Example: Markov Decision Processes
This tutorial was generated using Literate.jl. Download the source as a .jl
file. Download the source as a .ipynb
file.
SDDP.jl
can be used to solve a variety of Markov Decision processes. If the problem has continuous state and control spaces, and the objective and transition function are convex, then SDDP.jl can find a globally optimal policy. In other cases, SDDP.jl will find a locally optimal policy.
A simple example
A simple demonstration of this is the example taken from page 98 of the book "Markov Decision Processes: Discrete stochastic Dynamic Programming", by Martin L. Putterman.
The example, as described in Section 4.6.3 of the book, is to minimize a sum of squares of N
non-negative variables, subject to a budget constraint that the variable values add up to M
. Put mathematically, that is:
\[\begin{aligned} \min \;\; & \sum\limits_{i=1}^N x_i^2 \\ s.t. \;\; & \sum\limits_{i=1}^N x_i = M \\ & x_i \ge 0, \quad i \in 1,\ldots,N \end{aligned}\]
The optimal objective value is $M^2/N$, and the optimal solution is $x_i = M / N$, which can be shown by induction.
This can be reformulated as a Markov Decision Process by introducing a state variable, $s$, which tracks the un-spent budget over $N$ stages.
\[\begin{aligned} V_t(s) = \min \;\; & x^2 + V_{t+1}(s^\prime) \\ s.t. \;\; & s^\prime = s - x \\ & x \le s \\ & x \ge 0 \\ & s \ge 0 \end{aligned}\]
and in the last stage $V_N$, there is an additional constraint that $s^\prime = 0$.
The budget of $M$ is computed by solving for $V_1(M)$.
Since everything here is continuous and convex, SDDP.jl will find the globally optimal policy.
If the reformulation from the single problem into the recursive form of the Markov Decision Process is not obvious, consult Putterman's book.
We can model and solve this problem using SDDP.jl as follows:
using SDDP
import Ipopt
M, N = 5, 3
model = SDDP.LinearPolicyGraph(;
stages = N,
lower_bound = 0.0,
optimizer = Ipopt.Optimizer,
) do subproblem, node
@variable(subproblem, s >= 0, SDDP.State, initial_value = M)
@variable(subproblem, x >= 0)
@stageobjective(subproblem, x^2)
@constraint(subproblem, x <= s.in)
@constraint(subproblem, s.out == s.in - x)
if node == N
fix(s.out, 0.0; force = true)
end
return
end
SDDP.train(model)
-------------------------------------------------------------------
SDDP.jl (c) Oscar Dowson and contributors, 2017-24
-------------------------------------------------------------------
problem
nodes : 3
state variables : 1
scenarios : 1.00000e+00
existing cuts : false
options
solver : serial mode
risk measure : SDDP.Expectation()
sampling scheme : SDDP.InSampleMonteCarlo
subproblem structure
VariableRef : [4, 4]
AffExpr in MOI.EqualTo{Float64} : [1, 1]
AffExpr in MOI.LessThan{Float64} : [1, 1]
VariableRef in MOI.EqualTo{Float64} : [1, 1]
VariableRef in MOI.GreaterThan{Float64} : [2, 3]
VariableRef in MOI.LessThan{Float64} : [1, 1]
numerical stability report
matrix range [1e+00, 1e+00]
objective range [0e+00, 0e+00]
bounds range [0e+00, 0e+00]
rhs range [0e+00, 0e+00]
-------------------------------------------------------------------
iteration simulation bound time (s) solves pid
-------------------------------------------------------------------
1 2.499895e+01 1.562631e+00 1.594400e-02 6 1
40 8.333333e+00 8.333333e+00 6.701250e-01 246 1
-------------------------------------------------------------------
status : simulation_stopping
total time (s) : 6.701250e-01
total solves : 246
best bound : 8.333333e+00
simulation ci : 8.810723e+00 ± 8.167195e-01
numeric issues : 0
-------------------------------------------------------------------
Check that we got the theoretical optimum:
SDDP.calculate_bound(model), M^2 / N
(8.333333297449974, 8.333333333333334)
And check that we found the theoretical value for each $x_i$:
simulations = SDDP.simulate(model, 1, [:x])
for data in simulations[1]
println("x_$(data[:node_index]) = $(data[:x])")
end
x_1 = 1.666665732482549
x_2 = 1.6666672479151778
x_3 = 1.666667013035603
Close enough! We don't get exactly 5/3 because of numerical tolerances within our choice of optimization solver (in this case, Ipopt).
A more complicated policy
SDDP.jl is also capable of finding policies for other types of Markov Decision Processes. A classic example of a Markov Decision Process is the problem of finding a path through a maze.
Here's one example of a maze. Try changing the parameters to explore different mazes:
M, N = 3, 4
initial_square = (1, 1)
reward, illegal_squares, penalties = (3, 4), [(2, 2)], [(3, 1), (2, 4)]
path = fill("⋅", M, N)
path[initial_square...] = "1"
for (k, v) in (illegal_squares => "▩", penalties => "†", [reward] => "*")
for (i, j) in k
path[i, j] = v
end
end
print(join([join(path[i, :], ' ') for i in 1:size(path, 1)], '\n'))
1 ⋅ ⋅ ⋅
⋅ ▩ ⋅ †
† ⋅ ⋅ *
Our goal is to get from square 1
to square *
. If we step on a †
, we incur a penalty of 1
. Squares with ▩
are blocked; we cannot move there.
There are a variety of ways that we can solve this problem. We're going to solve it using a stationary binary stochastic programming formulation.
Our state variable will be a matrix of binary variables $x_{i,j}$, where each element is $1$ if the agent is in the square and $0$ otherwise. In each period, we incur a reward of $1$ if we are in the reward
square and a penalty of $-1$ if we are in a penalties
square. We cannot move to the illegal_squares
, so those $x_{i,j} = 0$. Feasibility between moves is modelled by constraints of the form:
\[x^\prime_{i,j} \le \sum\limits_{(a,b)\in P} x_{a,b}\]
where $P$ is the set of squares from which it is valid to move from (a, b)
to (i, j)
.
Because we are looking for a stationary policy, we need a unicyclic graph with a discount factor:
discount_factor = 0.9
graph = SDDP.UnicyclicGraph(discount_factor)
Root
0
Nodes
1
Arcs
0 => 1 w.p. 1.0
1 => 1 w.p. 0.9
Then we can formulate our full model:
import HiGHS
model = SDDP.PolicyGraph(
graph;
sense = :Max,
upper_bound = 1 / (1 - discount_factor),
optimizer = HiGHS.Optimizer,
) do sp, _
# Our state is a binary variable for each square
@variable(
sp,
x[i = 1:M, j = 1:N],
Bin,
SDDP.State,
initial_value = (i, j) == initial_square,
)
# Can only be in one square at a time
@constraint(sp, sum(x[i, j].out for i in 1:M, j in 1:N) == 1)
# Incur rewards and penalties
@stageobjective(
sp,
x[reward...].out - sum(x[i, j].out for (i, j) in penalties)
)
# Some squares are illegal
@constraint(sp, [(i, j) in illegal_squares], x[i, j].out <= 0)
# Constraints on valid moves
for i in 1:M, j in 1:N
moves = [(i - 1, j), (i + 1, j), (i, j), (i, j + 1), (i, j - 1)]
filter!(v -> 1 <= v[1] <= M && 1 <= v[2] <= N, moves)
@constraint(sp, x[i, j].out <= sum(x[a, b].in for (a, b) in moves))
end
return
end
A policy graph with 1 nodes.
Node indices: 1
The upper bound is obtained by assuming that we reach the reward square in one move and stay there.
Since there are discrete decisions here, SDDP.jl is not guaranteed to find the globally optimal policy.
SDDP.train(model)
-------------------------------------------------------------------
SDDP.jl (c) Oscar Dowson and contributors, 2017-24
-------------------------------------------------------------------
problem
nodes : 1
state variables : 12
scenarios : Inf
existing cuts : false
options
solver : serial mode
risk measure : SDDP.Expectation()
sampling scheme : SDDP.InSampleMonteCarlo
subproblem structure
VariableRef : [25, 25]
AffExpr in MOI.EqualTo{Float64} : [1, 1]
AffExpr in MOI.LessThan{Float64} : [13, 13]
VariableRef in MOI.LessThan{Float64} : [1, 1]
VariableRef in MOI.ZeroOne : [12, 12]
numerical stability report
matrix range [1e+00, 1e+00]
objective range [1e+00, 1e+00]
bounds range [1e+01, 1e+01]
rhs range [1e+00, 1e+00]
-------------------------------------------------------------------
iteration simulation bound time (s) solves pid
-------------------------------------------------------------------
1 0.000000e+00 1.000000e+01 5.419970e-03 15 1
40 0.000000e+00 6.561000e+00 8.708940e-01 2904 1
-------------------------------------------------------------------
status : simulation_stopping
total time (s) : 8.708940e-01
total solves : 2904
best bound : 6.561000e+00
simulation ci : 6.350000e+00 ± 3.746251e+00
numeric issues : 0
-------------------------------------------------------------------
Simulating a cyclic policy graph requires an explicit sampling_scheme
that does not terminate early based on the cycle probability:
simulations = SDDP.simulate(
model,
1,
[:x];
sampling_scheme = SDDP.InSampleMonteCarlo(;
max_depth = 5,
terminate_on_dummy_leaf = false,
),
);
Fill in the path
with the time-step in which we visit the square:
for (t, data) in enumerate(simulations[1]), i in 1:M, j in 1:N
if data[:x][i, j].in > 0.5
path[i, j] = "$t"
end
end
print(join([join(path[i, :], ' ') for i in 1:size(path, 1)], '\n'))
1 2 3 ⋅
⋅ ▩ 4 †
† ⋅ 5 *
This formulation will likely struggle as the number of cells in the maze increases. Can you think of an equivalent formulation that uses fewer state variables?