Hi! I am trying to code a problem in which I have two loops. An inner loop that a computes a sample average for a given agent using many many simulations of a particular function (which is not trivial, but each individual function is not excruciatingly slow either, it takes like one second) and in the outer loop I loop across different agents.

What is the best practice for paralellizing both the inner loop and the outer loop at the same time. Say, for a given agent, computing several simulations in parallel and at the same time computing the value for different agents in parallel.

using Random
const N = 1000
const S = 100000000
Data = rand(S,N)
function slow_inner_loop(data_n) # This function computes a sample average through Monte Carlo simulations
res_slow = zeros(S)
for s = 1 : S
res_slow[s] = data_n[s] # Here I have a function that computes a value for a given Monte Carlo Draw
end
return sum(res_slow)/S
end
function outer_loop(Data) # This function computes the sample average for N different "agents" and then computes a function with the computed averages
res_outer = zeros(N)
for n = 1 : N
res_outer[n] = slow_inner_loop(Data[:,n])
end
return sum(res_outer)/N
end
outer_loop(Data)

My shot at it can be, leave it to the masters (in this case, FLoops). I played with it now, and this is what I get here:

using Random
using FLoops
function mc()
sleep(0.01)
return 0.5
end
function slow_inner_loop(data_n) # This function computes a sample average through Monte Carlo simulations
S = length(data_n)
res_slow = zeros(S)
for s = 1 : S
res_slow[s] = mc()
end
return sum(res_slow)/S
end
function slow_inner_loop_floops(data_n) # This function computes a sample average through Monte Carlo simulations
S = length(data_n)
@floop for s = 1 : S
mc_s = mc()
@reduce( res_slow = zero(eltype(data_n)) + mc_s )
end
return sum(res_slow)/S
end
function outer_loop(Data) # This function computes the sample average for N different "agents" and then computes a function with the computed averages
N = size(Data,2)
res_outer = zeros(N)
for n = 1 : N
res_outer[n] = slow_inner_loop(Data[:,n])
end
return sum(res_outer)/N
end
function outer_loop_floops(Data) # This function computes the sample average for N different "agents" and then computes a function with the computed averages
N = size(Data,2)
@floop for n = 1 : N
r_inner = slow_inner_loop(Data[:,n])
@reduce(res_outer = zero(eltype(Data)) + r_inner)
end
return sum(res_outer)/N
end
function test(;N=10,S=100)
data = rand(S,N)
println("Serial")
outer_loop(rand(10,5))
@time (r1 = outer_loop(data))
println("FLoops:")
outer_loop_floops(rand(10,5))
@time (r2 = outer_loop_floops(data))
r1 ≈ r2
end

With that, I get:

julia> test()
Serial
11.316991 seconds (5.09 k allocations: 172.938 KiB)
FLoops:
1.172280 seconds (5.15 k allocations: 175.453 KiB)
true

Which means that @floop took the total time from 11 seconds to 1.17 seconds (here with 8 threads). I guess it is not only the parallelization which the macro is improving here, but the result is great overall, and probably the nested parallelization here is as good as it can be.

FYI, you can also write it as Folds.sum(f, Data) / S / N where f is the function you’d use on data_n[s]. Or, if you really want to do it as a nested loop (e.g., S is different for each inner loop) you can do Folds.sum(Folds.sum(f, @view Data[:, n]) / S for n in axes(Data, 2)) / N.

I guess you need to preallocate an array to store that data, and annotate it there. Nothing special (there is no concurrency there, as far as I can see).

There should be typically no performance difference. But, if something can be done with Folds, I’d recommend Folds. FLoops is more flexible so you can mess things up easily but it also means that you can hand-optimize things.

Is there a way to use pmap in the inner loop and @floops in the outer loop?

Or to do this with Distributed instead of multithreading?

Providing a little bit of context, the current solution based on FLoops spends a lot of time in Garbage Collection and uses a lot of memory. In this post, @odow provides a version based on Distributed which has comparable speed but uses much less memory.

I think you need to follow the directions that where given in the other thread, that is, you should not create a model at every iteration of the loop. I am not that familiar with JuMP to know how exactly to do that, but I can imagine that one alternative is to create a model for each thread, and update the data/iniital values of that model in hot loops inside each thread.

Maybe a pattern like

for threads in 1:Threads.nthreads()
model = JuMP.model(...)
for trail in 1:ntrials
# update model parameters for this thread
end
end