Example: Markov Decision Processes
SDDP.jl can be used to solve a variety of Markov Decision processes. If the problem has continous state and control spaces, and the objective and transition function are convex, then SDDP.jl can find a globally optimal policy. In other cases, SDDP.jl will find a locally optimal policy.
A simple example
A simple demonstration of this is the example taken from page 98 of the book "Markov Decision Processes: Discrete stochastic Dynamic Programming", by Martin L. Putterman.
The example, as described in Section 4.6.3 of the book, is to minimize a sum of squares of N non-negative variables, subject to a budget constraint that the variable values add up to M. Put mathematically, that is:
\[\begin{aligned} \min \;\; & \sum\limits_{i=1}^N x_i^2 \\ s.t. \;\; & \sum\limits_{i=1}^N x_i = M \\ & x_i \ge 0, \quad i \in 1,\ldots,N \end{aligned}\]
The optimal objective value is $M^2/N$, and the optimal solution is $x_i = M / N$, which can be shown by induction.
This can be reformulated as a Markov Decision Process by introducing a state variable, $s$, which tracks the un-spent budget over $N$ stages.
\[\begin{aligned} V_t(s) = \min \;\; & x^2 + V_{t+1}(s^\prime) \\ s.t. \;\; & s^\prime = s - x \\ & x \le s \\ & x \ge 0 \\ & s \ge 0 \end{aligned}\]
and in the last stage $V_N$, there is an additional constraint that $s^\prime = 0$.
The budget of $M$ is computed by solving for $V_1(M)$.
Since everything here is continuous and convex, SDDP.jl will find the globally optimal policy.
If the reformulation from the single problem into the recursive form of the Markov Decision Process is not obvious, consult Putterman's book.
We can model and solve this problem using SDDP.jl as follows:
using SDDP
import Ipopt
M, N = 5, 3
model = SDDP.LinearPolicyGraph(
stages = N,
lower_bound = 0.0,
optimizer = Ipopt.Optimizer,
) do subproblem, node
@variable(subproblem, s >= 0, SDDP.State, initial_value = M)
@variable(subproblem, x >= 0)
@stageobjective(subproblem, x^2)
@constraint(subproblem, x <= s.in)
@constraint(subproblem, s.out == s.in - x)
if node == N
fix(s.out, 0.0; force = true)
end
return
end
SDDP.train(model; stopping_rules = [SDDP.BoundStalling(5, 1e-6)])------------------------------------------------------------------------------
SDDP.jl (c) Oscar Dowson and SDDP.jl contributors, 2017-23
Problem
Nodes : 3
State variables : 1
Scenarios : 1.00000e+00
Existing cuts : false
Subproblem structure : (min, max)
Variables : (4, 4)
VariableRef in MOI.EqualTo{Float64} : (1, 1)
VariableRef in MOI.LessThan{Float64} : (1, 1)
AffExpr in MOI.LessThan{Float64} : (1, 1)
VariableRef in MOI.GreaterThan{Float64} : (2, 3)
AffExpr in MOI.EqualTo{Float64} : (1, 1)
Options
Solver : serial mode
Risk measure : SDDP.Expectation()
Sampling scheme : SDDP.InSampleMonteCarlo
Numerical stability report
Non-zero Matrix range [1e+00, 1e+00]
Non-zero Objective range [0e+00, 0e+00]
Non-zero Bounds range [0e+00, 0e+00]
Non-zero RHS range [0e+00, 0e+00]
No problems detected
Iteration Simulation Bound Time (s) Proc. ID # Solves
1 2.499895e+01 1.562631e+00 2.701592e-02 1 6
2 9.374869e+00 6.250337e+00 5.956388e-02 1 12
3 9.374777e+00 7.812434e+00 8.685684e-02 1 18
4 8.593609e+00 8.203174e+00 1.168809e-01 1 24
5 8.398407e+00 8.300761e+00 1.474738e-01 1 30
6 8.349577e+00 8.325206e+00 1.795769e-01 1 36
7 8.337394e+00 8.331293e+00 2.114289e-01 1 42
8 8.334343e+00 8.332827e+00 2.456009e-01 1 48
9 8.333586e+00 8.333205e+00 2.803500e-01 1 54
10 8.333395e+00 8.333302e+00 3.138540e-01 1 60
11 8.333349e+00 8.333325e+00 3.481760e-01 1 66
12 8.333337e+00 8.333331e+00 3.822370e-01 1 72
13 8.333334e+00 8.333333e+00 4.157438e-01 1 78
14 8.333333e+00 8.333333e+00 4.496050e-01 1 84
15 8.333333e+00 8.333333e+00 4.837790e-01 1 90
16 8.333333e+00 8.333333e+00 5.184879e-01 1 96
17 8.333333e+00 8.333333e+00 5.541210e-01 1 102
18 8.333333e+00 8.333333e+00 5.903649e-01 1 108
Terminating training
Status : bound_stalling
Total time (s) : 5.903649e-01
Total solves : 108
Best bound : 8.333333e+00
Simulation CI : 9.394200e+00 ± 1.805733e+00
------------------------------------------------------------------------------Check that we got the theoretical optimum:
SDDP.calculate_bound(model), M^2 / N(8.333333240437835, 8.333333333333334)And check that we cound the theoretical value for each $x_i$:
simulations = SDDP.simulate(model, 1, [:x])
for data in simulations[1]
println("x_$(data[:node_index]) = $(data[:x])")
endx_1 = 1.6666691570800714
x_2 = 1.6666657446246773
x_3 = 1.6666650917286001Close enough! We don't get exactly 5/3 because of numerical tolerances within our choice of optimization solver (in this case, Ipopt).
A more complicated policy
SDDP.jl is also capable of finding policies for other types of Markov Decision Processes. A classic example of a Markov Decision Process is the problem of finding a path through a maze.
Here's one example of a maze. Try changing the parameters to explore different mazes:
M, N = 3, 4
initial_square = (1, 1)
reward, illegal_squares, penalties = (3, 4), [(2, 2)], [(3, 1), (2, 4)]
path = fill("⋅", M, N)
path[initial_square...] = "1"
for (k, v) in (illegal_squares => "▩", penalties => "†", [reward] => "*")
for (i, j) in k
path[i, j] = v
end
end
print(join([join(path[i, :], ' ') for i in 1:size(path, 1)], '\n'))1 ⋅ ⋅ ⋅
⋅ ▩ ⋅ †
† ⋅ ⋅ *Our goal is to get from square 1 to square *. If we step on a †, we incur a penalty of 1. Squares with ▩ are blocked; we cannot move there.
There are a variety of ways that we can solve this problem. We're going to solve it using a stationary binary stochastic programming formulation.
Our state variable will be a matrix of binary variables $x_{i,j}$, where each element is $1$ if the agent is in the square and $0$ otherwise. In each period, we incur a reward of $1$ if we are in the reward square and a penalty of $-1$ if we are in a penalties square. We cannot move to the illegal_squares, so those $x_{i,j} = 0$. Feasibility between moves is modelled by constraints of the form:
\[x^\prime_{i,j} \le \sum\limits_{(a,b)\in P} x_{a,b}\]
where $P$ is the set of squares from which it is valid to move from (a, b) to (i, j).
Because we are looking for a stationary policy, we need a unicyclic graph with a discount factor:
discount_factor = 0.9
graph = SDDP.UnicyclicGraph(discount_factor)Root
0
Nodes
1
Arcs
0 => 1 w.p. 1.0
1 => 1 w.p. 0.9
Then we can formulate our full model:
import HiGHS
model = SDDP.PolicyGraph(
graph;
sense = :Max,
upper_bound = 1 / (1 - discount_factor),
optimizer = HiGHS.Optimizer,
) do sp, _
# Our state is a binary variable for each square
@variable(
sp,
x[i = 1:M, j = 1:N],
Bin,
SDDP.State,
initial_value = (i, j) == initial_square,
)
# Can only be in one square at a time
@constraint(sp, sum(x[i, j].out for i in 1:M, j in 1:N) == 1)
# Incur rewards and penalties
@stageobjective(
sp,
x[reward...].out - sum(x[i, j].out for (i, j) in penalties)
)
# Some squares are illegal
@constraint(sp, [(i, j) in illegal_squares], x[i, j].out <= 0)
# Constraints on valid moves
for i in 1:M, j in 1:N
moves = [(i - 1, j), (i + 1, j), (i, j), (i, j + 1), (i, j - 1)]
filter!(v -> 1 <= v[1] <= M && 1 <= v[2] <= N, moves)
@constraint(sp, x[i, j].out <= sum(x[a, b].in for (a, b) in moves))
end
return
endA policy graph with 1 nodes.
Node indices: 1
The upper bound is obtained by assuming that we reach the reward square in one move and stay there.
Since there are discrete decisions here, SDDP.jl is not guaranteed to find the gobally optimal policy.
SDDP.train(model; stopping_rules = [SDDP.BoundStalling(5, 1e-6)])------------------------------------------------------------------------------
SDDP.jl (c) Oscar Dowson and SDDP.jl contributors, 2017-23
Problem
Nodes : 1
State variables : 12
Scenarios : Inf
Existing cuts : false
Subproblem structure : (min, max)
Variables : (25, 25)
VariableRef in MOI.LessThan{Float64} : (1, 1)
VariableRef in MOI.ZeroOne : (12, 12)
AffExpr in MOI.LessThan{Float64} : (13, 13)
AffExpr in MOI.EqualTo{Float64} : (1, 1)
Options
Solver : serial mode
Risk measure : SDDP.Expectation()
Sampling scheme : SDDP.InSampleMonteCarlo
Numerical stability report
Non-zero Matrix range [1e+00, 1e+00]
Non-zero Objective range [1e+00, 1e+00]
Non-zero Bounds range [1e+01, 1e+01]
Non-zero RHS range [1e+00, 1e+00]
No problems detected
Iteration Simulation Bound Time (s) Proc. ID # Solves
1 0.000000e+00 9.000000e+00 2.547979e-03 1 3
2 0.000000e+00 9.958140e+00 7.233858e-03 1 18
3 0.000000e+00 7.217100e+00 9.635925e-03 1 23
4 0.000000e+00 7.217100e+00 1.269197e-02 1 30
5 0.000000e+00 7.217100e+00 3.873396e-02 1 85
6 0.000000e+00 7.217100e+00 4.134989e-02 1 90
7 8.000000e+00 6.843430e+00 4.980993e-02 1 115
8 0.000000e+00 6.815187e+00 5.291486e-02 1 122
9 3.000000e+00 6.727772e+00 5.866194e-02 1 137
10 8.000000e+00 6.619150e+00 6.673288e-02 1 162
11 5.000000e+00 6.588813e+00 7.329392e-02 1 181
12 2.400000e+01 6.562797e+00 9.049892e-02 1 238
13 3.300000e+01 6.561045e+00 1.135840e-01 1 313
14 3.000000e+00 6.561027e+00 1.194699e-01 1 328
15 0.000000e+00 6.561022e+00 1.233950e-01 1 337
16 0.000000e+00 6.561019e+00 1.258390e-01 1 342
17 1.000000e+00 6.561016e+00 1.306009e-01 1 353
18 8.000000e+00 6.561005e+00 1.396599e-01 1 378
19 6.000000e+00 6.561002e+00 1.475379e-01 1 399
20 0.000000e+00 6.561002e+00 1.494470e-01 1 402
21 2.000000e+00 6.561002e+00 1.550219e-01 1 415
22 3.000000e+00 6.561001e+00 1.610320e-01 1 430
23 0.000000e+00 6.561001e+00 1.716280e-01 1 451
24 0.000000e+00 6.561001e+00 1.748269e-01 1 458
Terminating training
Status : bound_stalling
Total time (s) : 1.748269e-01
Total solves : 458
Best bound : 6.561001e+00
Simulation CI : 4.333333e+00 ± 3.230246e+00
------------------------------------------------------------------------------Simulating a cyclic policy graph requires an explicit sampling_scheme that does not terminate early based on the cycle probability:
simulations = SDDP.simulate(
model,
1,
[:x];
sampling_scheme = SDDP.InSampleMonteCarlo(
max_depth = 5,
terminate_on_dummy_leaf = false,
),
);1-element Vector{Vector{Dict{Symbol, Any}}}:
[Dict(:bellman_term => 6.561000820831015, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 7.29000091203446, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 8.100001013371623, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 9.000001125968469, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 9.000001125968469, :noise_term => nothing, :node_index => 1, :stage_objective => 1.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 1.0)])]Fill in the path with the time-step in which we visit the square:
for (t, data) in enumerate(simulations[1]), i in 1:M, j in 1:N
if data[:x][i, j].in > 0.5
path[i, j] = "$t"
end
end
print(join([join(path[i, :], ' ') for i in 1:size(path, 1)], '\n'))1 2 3 ⋅
⋅ ▩ 4 †
† ⋅ 5 *This formulation will likely struggle as the number of cells in the maze increases. Can you think of an equivalent formulation that uses fewer state variables?
This page was generated using Literate.jl.