I am doing heterogeneous agents macro in olg. I have a model where the agents have a lot of choices at each point in time, and its becoming very slow to solve, especially when I increase the number of points in the state space. I am parallelizing the code using @threads. My question is: is there a gold standard in speeding up these kinds of problems in Julia? When you solve these problems, do you prefer to use the GPU?
To give a basis for discussion, the central point of my code (and the slowest part) is something like this:
T = 30
nb = 24
nh = 6
na = 3
nm = 21
ne = 7
vf = zeros(nb,nm,nh,na,ne,T) # i'm assuming the end-point is zero for every point in the state space, for simplicity
for ij = 1:T-1
Threads.@threads for ib = 1:nb
for im = 1:nm, ih = 1:nh, ia = 1:na, ie = 1:ne
v,b,m,h= solve_problem(param,ib,im,ih,ia,ie,T-ij,vf[:,:,:,:,T-ij+1])
vf[ib,im,ih,ia,ie,T-ij] = v
# store decision variables
end
end
end
The first thing is to tread Performance Tips · The Julia Language and apply the lessons there. Also, profiling is very important. Without profiling information, you are working in the dark. GPU might be useful but it depends very much what solve_problem does.
Thanks Kristoffer.
I have already ensured type stability, etc.
The profiler tells me that doing cubic splines and optimization problems take up the most time, which is what I am doing in solve_problem. The only real way around this is to introduce better guesses for the optimization problem, in order to reduce the number of iterations in the optimization.
The other way to speed this up would be to solve the most problems simultaneously, which I am trying to do with @threads, but I am constrained by the number of cores in my computer. For each ij, I need to solve nb * nh * nm * na * ne = 63,504 problems. Currently, I am instructing the computer to solve nb parallel problems. Could I instruct it to run more?
There are algorithms that are often more efficient than standard value function iteration. Which ones apply to your problem is hard to say without knowing more details than you want to share. But based on the heterogeneous agent OLG label I would think about endogenous grid points (Lilia Maliar and https://www.sas.upenn.edu/~jesusfv/Endogenous_Grid.pdf) (there are a few newer variations on this that I don’t recall right now).
Thanks for your comment.
I will look into the the endogenous grid method. However, this is deterministic OLG, so this is solved by backward induction rather than VFI, so I am not sure the method applies.
Even so, these methods usually entail filling out a value function VF(x,y,z,t), where x,y,z,t are state variables. For a large state space, i.e., where the grids of x, y, z,t are very fine, what is the best way to parallel the filling out of the value function?
Do you use CPU or GPU parallel computing?
Is there an alternative way?
A better formulation using @threads?
I would pay attention to how you’re setting up your interpolation, and make sure you aren’t duplicating any work. In my example, I construct one interpolated value function over the assets grid for each x state. When looping over assets and solving the optimization problem, each problem references the same interpolant.
Thanks for your comment.
Yes indeed. That is a great way to save time. I got an order of magnitude speed up from doing that in the past. I’ll recheck the code to make sure that i’m be as efficient there as possible.
I’ll check your blog post.
I’ll echo the point about precomputing as much as you can and reusing memory whenever possible. You should absolutely parallelize this. Threads would probably be enough if one call to solve_problem doesn’t take too terribly long.
The other thing to look into is how to shrink the state space by using smarter grid construction (like sparse grids) and algorithmic improvements like EGM as @hendri54 said. These are things that will take more investment in getting right but is ultimately where the performance lives.
Do you know if there is a way to do more operations simultaneously? Can I parallelize over individual points in the state space instead of over points in the grid?
I think the only thing you cannot parallelize over would be the time dimension. I see no reason why you could not solve the problem at every point in the state space separately within each time period, but perhaps there are models for which this is not true and I’m not aware.
Do you have any sense for how far you SHOULD be? When the state space blows up you can quickly get into a situation where you just do a lot of floating point operations, period. If that’s the case you might want to look at distributed rather than threads and then it’s off to the HPC.
I agree that I can, but how does one do this in Julia?
Can I write something like:
for ij = 1:T-1
Threads.@threads for ib = 1:nb, im = 1:nm, ih = 1:nh, ia = 1:na, ie = 1:ne
v,b,m,h= solve_problem(param,ib,im,ih,ia,ie,T-ij,vf[:,:,:,:,T-ij+1])
vf[ib,im,ih,ia,ie,T-ij] = v
# store decision variables
end
end
Thank you very much for the reference. It is very helpful
A few questions:
What is the advantage of using SharedArrays? I’ve experimented with them in the past, but came to the conclusion that using them in this application didn’t lead to gains (when using @distributed instead of @threads). I think the reason is that I don’t need to send information on the entire value function to each of the cores, but only bits of it.
From what I understand, for a given t, your state space is of size nx * na * nb * nz, right? From what I gather, you are doing nx simultaneous calculations but not nx * na * nb * nz, correct?
Re (1), distributed workers don’t share memory, as they are different processes. So while threads can just all write into the same array, distributed processes need to write into a shared array. There’s of course some overhead to that, but I found in this application it was well amortised (I ran this on my university’s HPC back in the day with iirc 48 processes max)
Re (2), part of why it was worth it was that I had to indeed do nxnanb*nz calculations - in this type of model future expected income depends both on the state of the persistent shock in tbf AR(1) z, and the current belief about idiosyncratic deterministic intercept and slope of the income profile.
Thanks for your reply. Let me try to explain my question. Lets say I do:
nb = 12
nz = 8
@sync @distributed for ib = 1:nb
for z = 1:nz
vf = solve_problem(params,ib,z,age)
end
end
lets say I have 48 cores in my computer, and wish to solve this as fast as possible. In this case be can be bonds and z, the AR(1) process for earnings. There is an outerloop with age, which is not shown.
Now, I have to do 12 x 8 = 96 calculations. However, I can only do 12 simultaneously with this formulation, but I have 48 cores. Is there an alternative formulation, with the same number of points, that allows me to use the set of cores and increases speed (ignoring the memory cost, for the moment)? I mean to say: solving simultaneously for 48 tuples (ib,z) instead of only twelve points ib?
A follow-up on this. The documentation on @distributed says:
A distributed memory, parallel for loop of the form :
@distributed [reducer] for var = range
body
end
The specified range is partitioned and locally executed across all workers.
In the light of what it says here, in the example that I wrote previously only 12 workers are executing simultaneously, since @distributed is applied to the range 1:nb, and not to the nested loops. Is this interpretation correct? If so, is there a way to change the code such that it is the state space (nb,nz) that is partitioned and executed across all workers instead of just over the range 1:nb?