Example: Markov Decision Processes

SDDP.jl can be used to solve a variety of Markov Decision processes. If the problem has continous state and control spaces, and the objective and transition function are convex, then SDDP.jl can find a globally optimal policy. In other cases, SDDP.jl will find a locally optimal policy.

A simple example

A simple demonstration of this is the example taken from page 98 of the book "Markov Decision Processes: Discrete stochastic Dynamic Programming", by Martin L. Putterman.

The example, as described in Section 4.6.3 of the book, is to minimize a sum of squares of N non-negative variables, subject to a budget constraint that the variable values add up to M. Put mathematically, that is:

\[\begin{aligned} \min \;\; & \sum\limits_{i=1}^N x_i^2 \\ s.t. \;\; & \sum\limits_{i=1}^N x_i = M \\ & x_i \ge 0, \quad i \in 1,\ldots,N \end{aligned}\]

The optimal objective value is $M^2/N$, and the optimal solution is $x_i = M / N$, which can be shown by induction.

This can be reformulated as a Markov Decision Process by introducing a state variable, $s$, which tracks the un-spent budget over $N$ stages.

\[\begin{aligned} V_t(s) = \min \;\; & x^2 + V_{t+1}(s^\prime) \\ s.t. \;\; & s^\prime = s - x \\ & x \le s \\ & x \ge 0 \\ & s \ge 0 \end{aligned}\]

and in the last stage $V_N$, there is an additional constraint that $s^\prime = 0$.

The budget of $M$ is computed by solving for $V_1(M)$.

Info

Since everything here is continuous and convex, SDDP.jl will find the globally optimal policy.

If the reformulation from the single problem into the recursive form of the Markov Decision Process is not obvious, consult Putterman's book.

We can model and solve this problem using SDDP.jl as follows:

using SDDP
import Ipopt

M, N = 5, 3

model = SDDP.LinearPolicyGraph(
    stages = N,
    lower_bound = 0.0,
    optimizer = Ipopt.Optimizer,
) do subproblem, node
    @variable(subproblem, s >= 0, SDDP.State, initial_value = M)
    @variable(subproblem, x >= 0)
    @stageobjective(subproblem, x^2)
    @constraint(subproblem, x <= s.in)
    @constraint(subproblem, s.out == s.in - x)
    if node == N
        fix(s.out, 0.0; force = true)
    end
    return
end

SDDP.train(model; stopping_rules = [SDDP.BoundStalling(5, 1e-6)])
------------------------------------------------------------------------------
          SDDP.jl (c) Oscar Dowson and SDDP.jl contributors, 2017-23

Problem
  Nodes           : 3
  State variables : 1
  Scenarios       : 1.00000e+00
  Existing cuts   : false
  Subproblem structure                      : (min, max)
    Variables                               : (4, 4)
    VariableRef in MOI.EqualTo{Float64}     : (1, 1)
    VariableRef in MOI.LessThan{Float64}    : (1, 1)
    AffExpr in MOI.LessThan{Float64}        : (1, 1)
    VariableRef in MOI.GreaterThan{Float64} : (2, 3)
    AffExpr in MOI.EqualTo{Float64}         : (1, 1)
Options
  Solver          : serial mode
  Risk measure    : SDDP.Expectation()
  Sampling scheme : SDDP.InSampleMonteCarlo

Numerical stability report
  Non-zero Matrix range     [1e+00, 1e+00]
  Non-zero Objective range  [0e+00, 0e+00]
  Non-zero Bounds range     [0e+00, 0e+00]
  Non-zero RHS range        [0e+00, 0e+00]
No problems detected

 Iteration    Simulation       Bound         Time (s)    Proc. ID   # Solves
        1    2.499895e+01   1.562631e+00   2.701592e-02          1          6
        2    9.374869e+00   6.250337e+00   5.956388e-02          1         12
        3    9.374777e+00   7.812434e+00   8.685684e-02          1         18
        4    8.593609e+00   8.203174e+00   1.168809e-01          1         24
        5    8.398407e+00   8.300761e+00   1.474738e-01          1         30
        6    8.349577e+00   8.325206e+00   1.795769e-01          1         36
        7    8.337394e+00   8.331293e+00   2.114289e-01          1         42
        8    8.334343e+00   8.332827e+00   2.456009e-01          1         48
        9    8.333586e+00   8.333205e+00   2.803500e-01          1         54
       10    8.333395e+00   8.333302e+00   3.138540e-01          1         60
       11    8.333349e+00   8.333325e+00   3.481760e-01          1         66
       12    8.333337e+00   8.333331e+00   3.822370e-01          1         72
       13    8.333334e+00   8.333333e+00   4.157438e-01          1         78
       14    8.333333e+00   8.333333e+00   4.496050e-01          1         84
       15    8.333333e+00   8.333333e+00   4.837790e-01          1         90
       16    8.333333e+00   8.333333e+00   5.184879e-01          1         96
       17    8.333333e+00   8.333333e+00   5.541210e-01          1        102
       18    8.333333e+00   8.333333e+00   5.903649e-01          1        108

Terminating training
  Status         : bound_stalling
  Total time (s) : 5.903649e-01
  Total solves   : 108
  Best bound     :  8.333333e+00
  Simulation CI  :  9.394200e+00 ± 1.805733e+00
------------------------------------------------------------------------------

Check that we got the theoretical optimum:

SDDP.calculate_bound(model), M^2 / N
(8.333333240437835, 8.333333333333334)

And check that we cound the theoretical value for each $x_i$:

simulations = SDDP.simulate(model, 1, [:x])
for data in simulations[1]
    println("x_$(data[:node_index]) = $(data[:x])")
end
x_1 = 1.6666691570800714
x_2 = 1.6666657446246773
x_3 = 1.6666650917286001

Close enough! We don't get exactly 5/3 because of numerical tolerances within our choice of optimization solver (in this case, Ipopt).

A more complicated policy

SDDP.jl is also capable of finding policies for other types of Markov Decision Processes. A classic example of a Markov Decision Process is the problem of finding a path through a maze.

Here's one example of a maze. Try changing the parameters to explore different mazes:

M, N = 3, 4
initial_square = (1, 1)
reward, illegal_squares, penalties = (3, 4), [(2, 2)], [(3, 1), (2, 4)]
path = fill("⋅", M, N)
path[initial_square...] = "1"
for (k, v) in (illegal_squares => "▩", penalties => "†", [reward] => "*")
    for (i, j) in k
        path[i, j] = v
    end
end
print(join([join(path[i, :], ' ') for i in 1:size(path, 1)], '\n'))
1 ⋅ ⋅ ⋅
⋅ ▩ ⋅ †
† ⋅ ⋅ *

Our goal is to get from square 1 to square *. If we step on a , we incur a penalty of 1. Squares with are blocked; we cannot move there.

There are a variety of ways that we can solve this problem. We're going to solve it using a stationary binary stochastic programming formulation.

Our state variable will be a matrix of binary variables $x_{i,j}$, where each element is $1$ if the agent is in the square and $0$ otherwise. In each period, we incur a reward of $1$ if we are in the reward square and a penalty of $-1$ if we are in a penalties square. We cannot move to the illegal_squares, so those $x_{i,j} = 0$. Feasibility between moves is modelled by constraints of the form:

\[x^\prime_{i,j} \le \sum\limits_{(a,b)\in P} x_{a,b}\]

where $P$ is the set of squares from which it is valid to move from (a, b) to (i, j).

Because we are looking for a stationary policy, we need a unicyclic graph with a discount factor:

discount_factor = 0.9
graph = SDDP.UnicyclicGraph(discount_factor)
Root
 0
Nodes
 1
Arcs
 0 => 1 w.p. 1.0
 1 => 1 w.p. 0.9

Then we can formulate our full model:

import HiGHS

model = SDDP.PolicyGraph(
    graph;
    sense = :Max,
    upper_bound = 1 / (1 - discount_factor),
    optimizer = HiGHS.Optimizer,
) do sp, _
    # Our state is a binary variable for each square
    @variable(
        sp,
        x[i = 1:M, j = 1:N],
        Bin,
        SDDP.State,
        initial_value = (i, j) == initial_square,
    )
    # Can only be in one square at a time
    @constraint(sp, sum(x[i, j].out for i in 1:M, j in 1:N) == 1)
    # Incur rewards and penalties
    @stageobjective(
        sp,
        x[reward...].out - sum(x[i, j].out for (i, j) in penalties)
    )
    # Some squares are illegal
    @constraint(sp, [(i, j) in illegal_squares], x[i, j].out <= 0)
    # Constraints on valid moves
    for i in 1:M, j in 1:N
        moves = [(i - 1, j), (i + 1, j), (i, j), (i, j + 1), (i, j - 1)]
        filter!(v -> 1 <= v[1] <= M && 1 <= v[2] <= N, moves)
        @constraint(sp, x[i, j].out <= sum(x[a, b].in for (a, b) in moves))
    end
    return
end
A policy graph with 1 nodes.
 Node indices: 1

The upper bound is obtained by assuming that we reach the reward square in one move and stay there.

Warning

Since there are discrete decisions here, SDDP.jl is not guaranteed to find the gobally optimal policy.

SDDP.train(model; stopping_rules = [SDDP.BoundStalling(5, 1e-6)])
------------------------------------------------------------------------------
          SDDP.jl (c) Oscar Dowson and SDDP.jl contributors, 2017-23

Problem
  Nodes           : 1
  State variables : 12
  Scenarios       : Inf
  Existing cuts   : false
  Subproblem structure                   : (min, max)
    Variables                            : (25, 25)
    VariableRef in MOI.LessThan{Float64} : (1, 1)
    VariableRef in MOI.ZeroOne           : (12, 12)
    AffExpr in MOI.LessThan{Float64}     : (13, 13)
    AffExpr in MOI.EqualTo{Float64}      : (1, 1)
Options
  Solver          : serial mode
  Risk measure    : SDDP.Expectation()
  Sampling scheme : SDDP.InSampleMonteCarlo

Numerical stability report
  Non-zero Matrix range     [1e+00, 1e+00]
  Non-zero Objective range  [1e+00, 1e+00]
  Non-zero Bounds range     [1e+01, 1e+01]
  Non-zero RHS range        [1e+00, 1e+00]
No problems detected

 Iteration    Simulation       Bound         Time (s)    Proc. ID   # Solves
        1    0.000000e+00   9.000000e+00   2.547979e-03          1          3
        2    0.000000e+00   9.958140e+00   7.233858e-03          1         18
        3    0.000000e+00   7.217100e+00   9.635925e-03          1         23
        4    0.000000e+00   7.217100e+00   1.269197e-02          1         30
        5    0.000000e+00   7.217100e+00   3.873396e-02          1         85
        6    0.000000e+00   7.217100e+00   4.134989e-02          1         90
        7    8.000000e+00   6.843430e+00   4.980993e-02          1        115
        8    0.000000e+00   6.815187e+00   5.291486e-02          1        122
        9    3.000000e+00   6.727772e+00   5.866194e-02          1        137
       10    8.000000e+00   6.619150e+00   6.673288e-02          1        162
       11    5.000000e+00   6.588813e+00   7.329392e-02          1        181
       12    2.400000e+01   6.562797e+00   9.049892e-02          1        238
       13    3.300000e+01   6.561045e+00   1.135840e-01          1        313
       14    3.000000e+00   6.561027e+00   1.194699e-01          1        328
       15    0.000000e+00   6.561022e+00   1.233950e-01          1        337
       16    0.000000e+00   6.561019e+00   1.258390e-01          1        342
       17    1.000000e+00   6.561016e+00   1.306009e-01          1        353
       18    8.000000e+00   6.561005e+00   1.396599e-01          1        378
       19    6.000000e+00   6.561002e+00   1.475379e-01          1        399
       20    0.000000e+00   6.561002e+00   1.494470e-01          1        402
       21    2.000000e+00   6.561002e+00   1.550219e-01          1        415
       22    3.000000e+00   6.561001e+00   1.610320e-01          1        430
       23    0.000000e+00   6.561001e+00   1.716280e-01          1        451
       24    0.000000e+00   6.561001e+00   1.748269e-01          1        458

Terminating training
  Status         : bound_stalling
  Total time (s) : 1.748269e-01
  Total solves   : 458
  Best bound     :  6.561001e+00
  Simulation CI  :  4.333333e+00 ± 3.230246e+00
------------------------------------------------------------------------------

Simulating a cyclic policy graph requires an explicit sampling_scheme that does not terminate early based on the cycle probability:

simulations = SDDP.simulate(
    model,
    1,
    [:x];
    sampling_scheme = SDDP.InSampleMonteCarlo(
        max_depth = 5,
        terminate_on_dummy_leaf = false,
    ),
);
1-element Vector{Vector{Dict{Symbol, Any}}}:
 [Dict(:bellman_term => 6.561000820831015, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 7.29000091203446, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 8.100001013371623, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 9.000001125968469, :noise_term => nothing, :node_index => 1, :stage_objective => 0.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 1.0) SDDP.State{Float64}(0.0, 0.0)]), Dict(:bellman_term => 9.000001125968469, :noise_term => nothing, :node_index => 1, :stage_objective => 1.0, :objective_state => nothing, :belief => Dict(1 => 1.0), :x => SDDP.State{Float64}[SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0); SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(0.0, 0.0) SDDP.State{Float64}(1.0, 0.0) SDDP.State{Float64}(0.0, 1.0)])]

Fill in the path with the time-step in which we visit the square:

for (t, data) in enumerate(simulations[1]), i in 1:M, j in 1:N
    if data[:x][i, j].in > 0.5
        path[i, j] = "$t"
    end
end

print(join([join(path[i, :], ' ') for i in 1:size(path, 1)], '\n'))
1 2 3 ⋅
⋅ ▩ 4 †
† ⋅ 5 *
Tip

This formulation will likely struggle as the number of cells in the maze increases. Can you think of an equivalent formulation that uses fewer state variables?


This page was generated using Literate.jl.