Expectation of multidimensional value function with discrete states

Hello,

I have a dynamic programming problem in which I need to calculate the conditional expectation of the next period value function of my problem.

The expectation is conditional on my current states and action, and is calculated integrating over the next-period states. I assume that the states follow a first order Markov process.

The problem itself is very simple but it can blow up really quick due to the combinatorial nature of the problem.

I have two actions, and four discrete states.

using Random, Distributions, BenchmarkTools

# dimensions
da = 2         # actions
dk = 5         # capital
dx = 10
dz = 15
du = 15

I generate dense transition matrices. Note that one of the transition matrices depends on the action.
Where F_k[a][i,j] = \Pr(k^\prime=j|k=i,a) and the same goes for other states.

# transition matrices
Fk = [rand(dk,dk) for a=1:da]
Fx = rand(dx,dx)
Fz = rand(dz,dz)
Fu = rand(du,du)

# normalization
Fk = [Fk[a]./= sum(Fk[a],dims=2) for a=1:da]
Fx ./= sum(Fx,dims=2)
Fz ./= sum(Fz,dims=2)
Fu ./= sum(Fu,dims=2)

I generate some random values for the value functions.

# value functions
V = rand(dk,dx,dz,du)

# expectation of V conditional on current period action and states
EV = zeros(da,dk,dx,dz,du)

Function that calculates the expectation over future states.

function expectation(a,k,x,z,u,V,Fk,Fx,Fz,Fu)
    FV = 0.0
    for ik=1:dk
        for ix=1:dx
            for iz=1:dz
                for iu=1:du
                    FV += Fk[a][k,ik] * Fx[x,ix] * Fz[z,iz] * Fu[u,iu] * V[ik,ix,iz,iu]
                end
            end
        end
    end
    return FV
end

I calculate the conditional expectations as follows.

# the expectation takes very long
@btime for ia=1:da
    for ik=1:dk
        for ix=1:dx
            for iz=1:dz
                for iu=1:du
                    EV[ia,ik,ix,iz,iu] += expectation(ia,ik,ix,iz,iu,V,Fk,Fx,Fz,Fu)
                end
            end
        end
    end
end

It’s clearly a very inefficient way to get these values. The memory use is high.

78.692 s (2061430725 allocations: 35.03 GiB)

I am sure that the performance tips for the for loops will help a lot, but I wonder if there is a more intelligent and elegant way to do access the data and perform the multiplications given the dimensionality of the problem.

Feel free to rearrange the problem (transition matrices, EV array, etc.) as you wish.

Thank you all.

1 Like

Don’t benchmark in global scope. (Read the performance tips.) You should have zero allocations once you put this code into a function.

PS. Your code is not runnable, so I guess you have some typo.

PPS. You don’t need all the argument-type declarations. Declaring function expectation(a,k,x,z,u,V,Fk,Fx,Fz,Fu) will be just as fast.

1 Like

To make this a bit more concrete, if you fix the typo in the type signature of expectation (i.e, remove all the type declarations), and declare the dimensions as const, then the run time goes down to 1.2s.

# dimensions
const da = 2         # actions
const dk = 5         # capital
const dx = 10
const dz = 15
const du = 15

# transition matrices
Fk = [rand(dk,dk) for a=1:da]
Fx = rand(dx,dx)
Fz = rand(dz,dz)
Fu = rand(du,du)

# normalization
Fk = [Fk[a]./= sum(Fk[a],dims=2) for a=1:da]
Fx ./= sum(Fx,dims=2)
Fz ./= sum(Fz,dims=2)
Fu ./= sum(Fu,dims=2)

# value functions
V = rand(dk,dx,dz,du)

# expectation of V conditional on current period action and states
EV = zeros(da,dk,dx,dz,du)

function expectation( a, k, x, z, u, V, Fk, Fx, Fz, Fu)
    FV = 0.0
    for ik=1:dk, ix=1:dx, iz=1:dz, iu=1:du
        FV += Fk[a][k,ik] * Fx[x,ix] * Fz[z,iz] * Fu[u,iu] * V[ik,ix,iz,iu]
    end
    return FV
end

function compute_EV!(EV, V, Fk, Fx, Fz, Fu)
    for ia=1:da, ik=1:dk, ix=1:dx, iz=1:dz, iu=1:du
        EV[ia,ik,ix,iz,iu] += expectation(ia,ik,ix,iz,iu,V,Fk,Fx,Fz,Fu)
    end
    return EV
end

@benchmark compute_EV!($EV, $V, $Fk, $Fx, $Fz, $Fu)
#BenchmarkTools.Trial: 
#  memory estimate:  0 bytes
#  allocs estimate:  0
#  --------------
#  minimum time:     1.234 s (0.00% GC)
#  median time:      1.235 s (0.00% GC)
#  mean time:        1.235 s (0.00% GC)
#  maximum time:     1.239 s (0.00% GC)
#  --------------
#  samples:          5
#  evals/sample:     1

I fixed the typo, thank you.

Thank you. However if I increase the dimensions of the problem, it becomes really slow. This will go inside a fixed point, so it must be done multiple times.

# dimensions
const da = 2         # actions
const dk = 5         # capital
const dx = 20
const dz = 25
const du = 25

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     72.789 s (0.00% GC)
  median time:      72.789 s (0.00% GC)
  mean time:        72.789 s (0.00% GC)
  maximum time:     72.789 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

For what it’s worth, you may want to consider as well using a package like Tullio (with LoopVectorization) to write these kinds of expectations for you. It makes things a lot easier, and it gives you faster code:

using Tullio, LoopVectorization

function compute_EV2!(EV, V, Fk, Fx, Fz, Fu)
    @tullio EV[a, k, x, z, u] = begin
        Fk[a][k,ik] * Fx[x, ix] * Fz[z, iz] * Fu[u, iu] * V[ik, ix, iz, iu]
    end
end

@benchmark compute_EV2!($EV, $V, $Fk, $Fx, $Fz, $Fu)
# BenchmarkTools.Trial: 
#   memory estimate:  0 bytes
#   allocs estimate:  0
#   --------------
#   minimum time:     338.559 ms (0.00% GC)
#   median time:      340.697 ms (0.00% GC)
#   mean time:        343.193 ms (0.00% GC)
#   maximum time:     351.591 ms (0.00% GC)
#   --------------
#   samples:          15
#   evals/sample:     1

That benchmark is single threaded, but Tullio will multithread the computation by default.

2 Likes

You have 9 nested loops — the runtime scales as the product of the sizes of all the loops. Any number to the 9th power gets big quickly!

If you want to get better scaling to large problems, you’ll need to fundamentally re-think your algorithm (or get a lot more computing power). e.g. maybe you can do some kind of Monte–Carlo sampling instead of looking at every combinatorial state every time.

4 Likes

One way to get around this issue (9 nested loops) is to compute each of the expectations iteratively. This works better for the same reason that chained matrix multiplication is more efficient when you compute each product sequentially, rather than doing it in a single nested loop. If all of the dimensions were the same size (call it n), then this way the algorithm is O(n^6) instead of O(n^9).

## Now do it the efficient way
function compute_EV3!(EV, EV_scratch, V, Fk, Fx, Fz, Fu)
    
    # Handle each expectation iteratively. First over u
    @tullio EV[a, k, x, z, u] = Fu[u, iu] * V[k, x, z, iu]
    EV_scratch .= EV

    # Now over z 
    @tullio EV[a,k,x, z,u] = Fz[z, iz] * EV_scratch[a, k, x, iz, u]
    EV_scratch .= EV

    # Now over x
    @tullio EV[a, k, x, z, u] = Fx[x, ix] * EV_scratch[a, k, ix, z, u]
    EV_scratch .= EV

    # Now over k
    @tullio EV[a, k, x, z, u] = Fk[a][k, ik] * EV_scratch[a, ik, x, z, u]

    return EV
end

EV3 = similar(EV)
EVS = similar(EV)
@benchmark compute_EV3!($EV3, $EVS, $V, $Fk, $Fx, $Fz, $Fu)

# BenchmarkTools.Trial: 
#   memory estimate:  1.69 KiB
#   allocs estimate:  36
#   --------------
#   minimum time:     935.304 μs (0.00% GC)
#   median time:      1.036 ms (0.00% GC)
#   mean time:        1.061 ms (0.00% GC)
#   maximum time:     1.865 ms (0.00% GC)
#   --------------
#   samples:          4709
#   evals/sample:     1

Now, when I blow up the problem to the larger size you were considering, the entire algorithm only takes about 2ms (with multithreading on)

3 Likes

The real problem here is that VFI is just slow. The best way to improve your performance here is to pick a different solution algorithm if you can.

But, conditional on VFI, definitely look into Tullio.jl for these matrix products. Precomputing might also help, for example I’m guessing Fx, Fz, and Fu are fixed throughout the solution, so you could precompute all the Fx * Fz * Fu terms. Finally, because it is just a sum reduction, you can multithread it if if is long enough to warrant the (small) overhead. Or maybe GPU?

1 Like

This should be a good starting point for the usual bag of tricks in Economics:

@incollection{maliar2014numerical,
  title={Numerical methods for large-scale dynamic economic models},
  author={Maliar, Lilia and Maliar, Serguei},
  booktitle={Handbook of computational economics},
  volume=3,
  pages={325--477},
  year=2014,
  publisher={Elsevier}
}

I am working on a library for Smolyak interpolations right now (there are several existing ones in Julia, my purposes is to be allocation-free for multithreaded code).

3 Likes

@Tamas_Papp This may not be what you want, but I have an existing implementation of adaptive Smolyak interpolation (following Brumm and Scheidegger 2017) which does not allocate for function evaluations. It’s currently registered as AdaptiveSparseGrids.jl.

3 Likes

Use that here! If you can of course. I’m still working under the assumption that x,z,u are exogenous variables since you didn’t say otherwise (your problem also says discrete states but I’m assuming it’s a discretized continuous process, perhaps I’m just assuming too much at this point…). Why not approximate their joint density over a sparse grid? Then you move from something like 25^3 points to… 100? I’ve not actually seen this done, but it was an idea I had a while back, just never had the chance/reason to implement it.

1 Like

I dunno, I’m not actually convinced that this would be appropriate for a problem this size. Even on the bigger version of their problem, computing the expectations on grids only takes 2 milliseconds.

Moving to sparse grids is a lot more costly than other standard interpolation techniques. It can pay off in higher dimensional problems, but it would not be my first choice for something like this, especially if they can get the method on grids to work.

1 Like

@tbeason, VFI is indeed slow, but I believe that any other method to solve a DP problem will require the expectation over the value function, right?

To precompute the matrices is good advice. Just need to take care with not running out of memory.

Right, you cannot get away from computing the expectation at all, but you can control how many times it needs to be computed.

Also, sorry @jacobadenbaum , I think I confused you for the OP with my last post haha

1 Like

You are right, I am discretizing the states. I will try solving it integrating over continuous states with sparse grids, thank you.

I cannot run your code. I am trying to run the following code but I get an UndefVarError. Am I doing something wrong?

using Random, Distributions, BenchmarkTools, Tullio, LoopVectorization

# dimensions
const da = 2         # actions
const dk = 5         # capital
const dx = 10
const dz = 15
const du = 15

# transition matrices
Fk = [rand(dk,dk) for a=1:da]
Fx = rand(dx,dx)
Fz = rand(dz,dz)
Fu = rand(du,du)

# normalization
Fk = [Fk[a]./= sum(Fk[a],dims=2) for a=1:da]
Fx ./= sum(Fx,dims=2)
Fz ./= sum(Fz,dims=2)
Fu ./= sum(Fu,dims=2)

# value functions
V = rand(dk,dx,dz,du)

# expectation of V conditional on current period action and states
EV = zeros(da,dk,dx,dz,du)

## Now do it the efficient way
function compute_EV3!(EV, EV_scratch, V, Fk, Fx, Fz, Fu)

    # Handle each expectation iteratively. First over u
    @tullio EV[a, k, x, z, u] = Fu[u, iu] * V[k, x, z, iu]
    EV_scratch .= EV

    # Now over z
    @tullio EV[a,k,x, z,u] = Fz[z, iz] * EV_scratch[a, k, x, iz, u]
    EV_scratch .= EV

    # Now over x
    @tullio EV[a, k, x, z, u] = Fx[x, ix] * EV_scratch[a, k, ix, z, u]
    EV_scratch .= EV

    # Now over k
    @tullio EV[a, k, x, z, u] = Fk[a][k, ik] * EV_scratch[a, ik, x, z, u]

    return EV
end

EV3 = similar(EV)
EVS = similar(EV)
compute_EV3!(EV3,EVS,V,Fk,Fx,Fz,Fu)

Unfortunately, I get this when I try to run compute_EV3!(EV3,EVS,V,Fk,Fx,Fz,Fu).

UndefVarError: ####op#313_0 not defined
in top-level scope at dp2.jl:51
in compute_EV3! at dp2.jl:32
in macro expansion at Tullio\PhxRJ\src\macro.jl:1001 
in threader at Tullio\PhxRJ\src\threads.jl:63 
in thread_halves at Tullio\PhxRJ\src\threads.jl:125
in thread_halves at Tullio\PhxRJ\src\threads.jl:128
in tile_halves at Tullio\PhxRJ\src\threads.jl:136 
in tile_halves at Tullio\PhxRJ\src\threads.jl:139
in 𝒜𝒸𝓉! at Tullio\PhxRJ\src\macro.jl:1090 
in _avx_! at LoopVectorization\8MeJu\src\reconstruct_loopset.jl:526
in macro expansion at LoopVectorization\8MeJu\src\reconstruct_loopset.jl:526 

Oh yeah that seems to be a bug with LoopVectorization. I’m not sure why, but try running it without loading LoopVectorization (or by passing the argument avx = false, i.e, @tullio avx = false EV[a,k, x, z, u ] = ...)

1 Like

One more note on this: even if you switch over to a continuous integration on the sparse grids (and be careful here, your mileage may vary), you may still want to experiment with computing the expectations over each shock iteratively. The integrand here is still fundamentally 9-dimensional, and it may be a lot less costly to approximate a 6-dimensional integrand 4 times than to attack the big one directly. You’ll have to experiment to see which one gives you better performance in your actual code.

3 Likes

I fixed the LoopVectorization bug, but instead of this:

@tullio EV[a, k, x, z, u] = Fu[u, iu] * V[k, x, z, iu]

maybe you should do this:

EVview = @view(EV[begin,:,:,:,:])
@tullio EVview[k, x, z, u] = Fu[u, iu] * V[k, x, z, iu]
@inbounds for I in CartesianIndices(Evview)
    ev = EVview[I]
    for a in axes(EV_scratch, 1)
        EV_scratch[a,I] = ev
    end
end

Two sets of 5 loops is O(N^5), which should be much faster than O(N^6).
This would probably be even better:

evview_size = Base.tail(size(EV))
EVview = reshape(view(vec(EV), 1:prod(evview_size)), evview_size)

The idea above is just to make accesses to the first dimension contiguous.

7 Likes