FLoops composable way of multithreading nested loops?

I’m writing an algorithm for a discrete time dynamic programming problem in economics. The algorithm involves nested looping over 4 state variables - 2 exogenous, 2 endogenous. The solution to the problem 4 dimensional array, with each dimension corresponding to one of the state variables.

I would like to multithread these nested loops if I can. From reading here, FLoops.jl seems to be an ideal way of doing this. However, there is a complication because the outer two loops need to wait until the inner two loops have completed before they can move on to their next iteration.

Below I’ve set out a toy example of what I need to do:

for (iz,iβ) in Iterators.product(eachindex(zvals), eachindex(βvals))

         for (id,ia) in Iterators.product(eachindex(dgrid), eachindex(agrid))

                         tmp [id,ia]  = somecalcs(id,ia,iz,iβ) #compute the solution

         end
         
        @views soln[:,:, iz, iβ] = curvinterp(tmp[id,ia],amesh,dmesh) #interpolate solutions onto rectilinear mesh

        @views soln[:, :, iz, iβ] = swapconstraints!(soln[id, ia, iz, iβ], iz, iβ)  #swap in constrained solutions

end

Should I use Floops.jl - e.g., @floop ThreadedEx() - on both the outer and inner loops?

I know that I can get an answer to this question by experimentation using @btime.

But I’m wondering if there’s a more theoretical answer rooted in how Floops.jl works and the different executor options work. …maybe I should use a different executor for the outer loop?

Greatly appreciate any advice anyone has.

-Patrick

If length(zvals) * length(βvals) is larger than the number of CPU cores you have and the loop body has similar workload, parallelizing for (iz,iβ) in should be fine. Otherwise, maybe you’d need

@floop ThreadedEx(basesize = 1) for iβ in eachindex(βvals), iz in eachindex(zvals)
    @floop ThreadedEx() for ia in eachindex(agrid), id in eachindex(dgrid)
...

This uses basesize = 1 on the outer loop so that each iteration uses one task (i.e., maximally parallelize the loop). The inner loop can also be ThreadedEx(basesize = max(1, Threads.nthreads() ÷ length(zvals) * length(βvals))) to reduce the number of tasks a bit.

1 Like

Great! Thank you so much.