Parallelizing a Nested Loop Concurrently

Hi! I am trying to code a problem in which I have two loops. An inner loop that a computes a sample average for a given agent using many many simulations of a particular function (which is not trivial, but each individual function is not excruciatingly slow either, it takes like one second) and in the outer loop I loop across different agents.

What is the best practice for paralellizing both the inner loop and the outer loop at the same time. Say, for a given agent, computing several simulations in parallel and at the same time computing the value for different agents in parallel.

using Random

const N = 1000
const S = 100000000

Data = rand(S,N)


function slow_inner_loop(data_n)  # This function computes a sample average through Monte Carlo simulations

    res_slow = zeros(S)

    for s = 1 : S
        res_slow[s] = data_n[s]  # Here I have a function that computes a value for a given Monte Carlo Draw 
    end    

    return sum(res_slow)/S

end    


function outer_loop(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages

    res_outer = zeros(N)

    for n = 1 : N
        res_outer[n] = slow_inner_loop(Data[:,n])
    end    

    return sum(res_outer)/N

end    

outer_loop(Data)

My shot at it can be, leave it to the masters (in this case, FLoops). I played with it now, and this is what I get here:

using Random
using FLoops

function mc()
    sleep(0.01)
    return 0.5
end

function slow_inner_loop(data_n)  # This function computes a sample average through Monte Carlo simulations
    S = length(data_n)
    res_slow = zeros(S)
    for s = 1 : S
        res_slow[s] = mc()
    end    
    return sum(res_slow)/S
end    

function slow_inner_loop_floops(data_n)  # This function computes a sample average through Monte Carlo simulations
    S = length(data_n)
    @floop for s = 1 : S
        mc_s = mc()
        @reduce( res_slow = zero(eltype(data_n)) + mc_s )
    end    
    return sum(res_slow)/S
end    


function outer_loop(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages
    N = size(Data,2)
    res_outer = zeros(N)
    for n = 1 : N
        res_outer[n] = slow_inner_loop(Data[:,n])
    end    
    return sum(res_outer)/N
end    

function outer_loop_floops(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages
    N = size(Data,2)
    @floop for n = 1 : N
        r_inner = slow_inner_loop(Data[:,n])
        @reduce(res_outer = zero(eltype(Data)) + r_inner)
    end    
    return sum(res_outer)/N
end    

function test(;N=10,S=100)
    data = rand(S,N)

    println("Serial")
    outer_loop(rand(10,5))
    @time (r1 = outer_loop(data))

    println("FLoops:")
    outer_loop_floops(rand(10,5))
    @time (r2 = outer_loop_floops(data))

    r1 ≈ r2
end
 

With that, I get:

julia> test()
Serial
 11.316991 seconds (5.09 k allocations: 172.938 KiB)
FLoops:
  1.172280 seconds (5.15 k allocations: 175.453 KiB)
true

Which means that @floop took the total time from 11 seconds to 1.17 seconds (here with 8 threads). I guess it is not only the parallelization which the macro is improving here, but the result is great overall, and probably the nested parallelization here is as good as it can be.

3 Likes

Hi @lmiq this is amazing. Just a quick follow up.

Assume that in outer_loop_floops() I actually want to store the value for each n. What would be the proper way to do this?

How did the code produce better than ideal speedup?

FYI, you can also write it as Folds.sum(f, Data) / S / N where f is the function you’d use on data_n[s]. Or, if you really want to do it as a nested loop (e.g., S is different for each inner loop) you can do Folds.sum(Folds.sum(f, @view Data[:, n]) / S for n in axes(Data, 2)) / N.

See also:

2 Likes

I guess you need to preallocate an array to store that data, and annotate it there. Nothing special (there is no concurrency there, as far as I can see).

But with SharedArrays?

With the Folds. would it be running in parallel? S is always the same across n's. Which package should I upload to use this?

I don’t know, but the low hanging fruit in the serial code is to use a view for Data[:,n], and the macro may be doing something smart with it.

I’ve no experience with these.

Yes, Folds functions are all parallel (by default). Folds.jl is here: https://github.com/JuliaFolds/Folds.jl. To start parallel programming in Julia, see: A quick introduction to data parallelism in Julia

2 Likes

Should one option be faster than the other?

What are you comparing?

FLoops vrs Folds

There should be typically no performance difference. But, if something can be done with Folds, I’d recommend Folds. FLoops is more flexible so you can mess things up easily but it also means that you can hand-optimize things.

3 Likes

Is there a way to use pmap in the inner loop and @floops in the outer loop?

Or to do this with Distributed instead of multithreading?

Providing a little bit of context, the current solution based on FLoops spends a lot of time in Garbage Collection and uses a lot of memory. In this post, @odow provides a version based on Distributed which has comparable speed but uses much less memory.

@lmiq

I think you need to follow the directions that where given in the other thread, that is, you should not create a model at every iteration of the loop. I am not that familiar with JuMP to know how exactly to do that, but I can imagine that one alternative is to create a model for each thread, and update the data/iniital values of that model in hot loops inside each thread.

Maybe a pattern like

for threads in 1:Threads.nthreads()
    model = JuMP.model(...)
    for trail in 1:ntrials
         # update model parameters for this thread
    end
end
1 Like