# Parallelizing a Nested Loop Concurrently

Hi! I am trying to code a problem in which I have two loops. An inner loop that a computes a sample average for a given agent using many many simulations of a particular function (which is not trivial, but each individual function is not excruciatingly slow either, it takes like one second) and in the outer loop I loop across different agents.

What is the best practice for paralellizing both the inner loop and the outer loop at the same time. Say, for a given agent, computing several simulations in parallel and at the same time computing the value for different agents in parallel.

``````using Random

const N = 1000
const S = 100000000

Data = rand(S,N)

function slow_inner_loop(data_n)  # This function computes a sample average through Monte Carlo simulations

res_slow = zeros(S)

for s = 1 : S
res_slow[s] = data_n[s]  # Here I have a function that computes a value for a given Monte Carlo Draw
end

return sum(res_slow)/S

end

function outer_loop(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages

res_outer = zeros(N)

for n = 1 : N
res_outer[n] = slow_inner_loop(Data[:,n])
end

return sum(res_outer)/N

end

outer_loop(Data)
``````

My shot at it can be, leave it to the masters (in this case, FLoops). I played with it now, and this is what I get here:

``````using Random
using FLoops

function mc()
sleep(0.01)
return 0.5
end

function slow_inner_loop(data_n)  # This function computes a sample average through Monte Carlo simulations
S = length(data_n)
res_slow = zeros(S)
for s = 1 : S
res_slow[s] = mc()
end
return sum(res_slow)/S
end

function slow_inner_loop_floops(data_n)  # This function computes a sample average through Monte Carlo simulations
S = length(data_n)
@floop for s = 1 : S
mc_s = mc()
@reduce( res_slow = zero(eltype(data_n)) + mc_s )
end
return sum(res_slow)/S
end

function outer_loop(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages
N = size(Data,2)
res_outer = zeros(N)
for n = 1 : N
res_outer[n] = slow_inner_loop(Data[:,n])
end
return sum(res_outer)/N
end

function outer_loop_floops(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages
N = size(Data,2)
@floop for n = 1 : N
r_inner = slow_inner_loop(Data[:,n])
@reduce(res_outer = zero(eltype(Data)) + r_inner)
end
return sum(res_outer)/N
end

function test(;N=10,S=100)
data = rand(S,N)

println("Serial")
outer_loop(rand(10,5))
@time (r1 = outer_loop(data))

println("FLoops:")
outer_loop_floops(rand(10,5))
@time (r2 = outer_loop_floops(data))

r1 ≈ r2
end

``````

With that, I get:

``````julia> test()
Serial
11.316991 seconds (5.09 k allocations: 172.938 KiB)
FLoops:
1.172280 seconds (5.15 k allocations: 175.453 KiB)
true

``````

Which means that `@floop` took the total time from 11 seconds to 1.17 seconds (here with 8 threads). I guess it is not only the parallelization which the macro is improving here, but the result is great overall, and probably the nested parallelization here is as good as it can be.

3 Likes

Hi @lmiq this is amazing. Just a quick follow up.

Assume that in `outer_loop_floops()` I actually want to store the value for each `n`. What would be the proper way to do this?

How did the code produce better than ideal speedup?

FYI, you can also write it as `Folds.sum(f, Data) / S / N` where `f` is the function you’d use on `data_n[s]`. Or, if you really want to do it as a nested loop (e.g., `S` is different for each inner loop) you can do `Folds.sum(Folds.sum(f, @view Data[:, n]) / S for n in axes(Data, 2)) / N`.

2 Likes

I guess you need to preallocate an array to store that data, and annotate it there. Nothing special (there is no concurrency there, as far as I can see).

But with SharedArrays?

With the `Folds.` would it be running in parallel? `S` is always the same across `n`'s. Which package should I upload to use this?

I don’t know, but the low hanging fruit in the serial code is to use a view for `Data[:,n]`, and the macro may be doing something smart with it.

I’ve no experience with these.

Yes, `Folds` functions are all parallel (by default). Folds.jl is here: https://github.com/JuliaFolds/Folds.jl. To start parallel programming in Julia, see: A quick introduction to data parallelism in Julia

2 Likes

Should one option be faster than the other?

What are you comparing?

FLoops vrs Folds

There should be typically no performance difference. But, if something can be done with Folds, I’d recommend Folds. FLoops is more flexible so you can mess things up easily but it also means that you can hand-optimize things.

3 Likes

Is there a way to use `pmap` in the inner loop and `@floops` in the outer loop?

Or to do this with `Distributed` instead of multithreading?

Providing a little bit of context, the current solution based on FLoops spends a lot of time in Garbage Collection and uses a lot of memory. In this post, @odow provides a version based on `Distributed` which has comparable speed but uses much less memory.

I think you need to follow the directions that where given in the other thread, that is, you should not create a model at every iteration of the loop. I am not that familiar with JuMP to know how exactly to do that, but I can imagine that one alternative is to create a model for each thread, and update the data/iniital values of that model in hot loops inside each thread.

Maybe a pattern like

``````for threads in 1:Threads.nthreads()
model = JuMP.model(...)
for trail in 1:ntrials
# update model parameters for this thread
end
end
``````
1 Like