Parallelizing a Nested Loop Concurrently

jmcastro2109 · January 28, 2022, 11:56pm

Hi! I am trying to code a problem in which I have two loops. An inner loop that a computes a sample average for a given agent using many many simulations of a particular function (which is not trivial, but each individual function is not excruciatingly slow either, it takes like one second) and in the outer loop I loop across different agents.

What is the best practice for paralellizing both the inner loop and the outer loop at the same time. Say, for a given agent, computing several simulations in parallel and at the same time computing the value for different agents in parallel.

using Random

const N = 1000
const S = 100000000

Data = rand(S,N)


function slow_inner_loop(data_n)  # This function computes a sample average through Monte Carlo simulations

    res_slow = zeros(S)

    for s = 1 : S
        res_slow[s] = data_n[s]  # Here I have a function that computes a value for a given Monte Carlo Draw 
    end    

    return sum(res_slow)/S

end    


function outer_loop(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages

    res_outer = zeros(N)

    for n = 1 : N
        res_outer[n] = slow_inner_loop(Data[:,n])
    end    

    return sum(res_outer)/N

end    

outer_loop(Data)

lmiq · January 29, 2022, 12:15am

My shot at it can be, leave it to the masters (in this case, FLoops). I played with it now, and this is what I get here:

using Random
using FLoops

function mc()
    sleep(0.01)
    return 0.5
end

function slow_inner_loop(data_n)  # This function computes a sample average through Monte Carlo simulations
    S = length(data_n)
    res_slow = zeros(S)
    for s = 1 : S
        res_slow[s] = mc()
    end    
    return sum(res_slow)/S
end    

function slow_inner_loop_floops(data_n)  # This function computes a sample average through Monte Carlo simulations
    S = length(data_n)
    @floop for s = 1 : S
        mc_s = mc()
        @reduce( res_slow = zero(eltype(data_n)) + mc_s )
    end    
    return sum(res_slow)/S
end    


function outer_loop(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages
    N = size(Data,2)
    res_outer = zeros(N)
    for n = 1 : N
        res_outer[n] = slow_inner_loop(Data[:,n])
    end    
    return sum(res_outer)/N
end    

function outer_loop_floops(Data)  # This function computes the sample average for N different "agents" and then computes a function with the computed averages
    N = size(Data,2)
    @floop for n = 1 : N
        r_inner = slow_inner_loop(Data[:,n])
        @reduce(res_outer = zero(eltype(Data)) + r_inner)
    end    
    return sum(res_outer)/N
end    

function test(;N=10,S=100)
    data = rand(S,N)

    println("Serial")
    outer_loop(rand(10,5))
    @time (r1 = outer_loop(data))

    println("FLoops:")
    outer_loop_floops(rand(10,5))
    @time (r2 = outer_loop_floops(data))

    r1 ≈ r2
end

With that, I get:

julia> test()
Serial
 11.316991 seconds (5.09 k allocations: 172.938 KiB)
FLoops:
  1.172280 seconds (5.15 k allocations: 175.453 KiB)
true

Which means that @floop took the total time from 11 seconds to 1.17 seconds (here with 8 threads). I guess it is not only the parallelization which the macro is improving here, but the result is great overall, and probably the nested parallelization here is as good as it can be.

jmcastro2109 · January 29, 2022, 12:26am

Hi @lmiq this is amazing. Just a quick follow up.

Assume that in outer_loop_floops() I actually want to store the value for each n. What would be the proper way to do this?

PetrKryslUCSD · January 29, 2022, 12:29am

How did the code produce better than ideal speedup?

tkf · January 29, 2022, 12:37am

FYI, you can also write it as Folds.sum(f, Data) / S / N where f is the function you’d use on data_n[s]. Or, if you really want to do it as a nested loop (e.g., S is different for each inner loop) you can do Folds.sum(Folds.sum(f, @view Data[:, n]) / S for n in axes(Data, 2)) / N.

See also:

lmiq · January 29, 2022, 12:43am

I guess you need to preallocate an array to store that data, and annotate it there. Nothing special (there is no concurrency there, as far as I can see).

jmcastro2109 · January 29, 2022, 12:52am

But with SharedArrays?

jmcastro2109 · January 29, 2022, 12:53am

With the Folds. would it be running in parallel? S is always the same across n’s. Which package should I upload to use this?

lmiq · January 29, 2022, 1:01am

I don’t know, but the low hanging fruit in the serial code is to use a view for Data[:,n], and the macro may be doing something smart with it.

lmiq · January 29, 2022, 1:02am

I’ve no experience with these.

tkf · January 29, 2022, 2:57am

Yes, Folds functions are all parallel (by default). Folds.jl is here: https://github.com/JuliaFolds/Folds.jl. To start parallel programming in Julia, see: A quick introduction to data parallelism in Julia

jmcastro2109 · January 29, 2022, 3:24pm

Should one option be faster than the other?

tkf · January 29, 2022, 8:18pm

What are you comparing?

jmcastro2109 · January 29, 2022, 8:33pm

FLoops vrs Folds

tkf · January 29, 2022, 8:37pm

There should be typically no performance difference. But, if something can be done with Folds, I’d recommend Folds. FLoops is more flexible so you can mess things up easily but it also means that you can hand-optimize things.

jmcastro2109 · February 1, 2022, 1:58pm

Is there a way to use pmap in the inner loop and @floops in the outer loop?

Or to do this with Distributed instead of multithreading?

Providing a little bit of context, the current solution based on FLoops spends a lot of time in Garbage Collection and uses a lot of memory. In this post, @odow provides a version based on Distributed which has comparable speed but uses much less memory.

jmcastro2109 · February 1, 2022, 5:08pm

@lmiq

lmiq · February 1, 2022, 5:23pm

I think you need to follow the directions that where given in the other thread, that is, you should not create a model at every iteration of the loop. I am not that familiar with JuMP to know how exactly to do that, but I can imagine that one alternative is to create a model for each thread, and update the data/iniital values of that model in hot loops inside each thread.

Maybe a pattern like

for threads in 1:Threads.nthreads()
    model = JuMP.model(...)
    for trail in 1:ntrials
         # update model parameters for this thread
    end
end

Topic		Replies	Views
Multi-threading for nested loops with inner loop collection depending on outer loop Performance parallel , multithreading , io	7	707	April 18, 2023
FLoops composable way of multithreading nested loops? Performance parallel , multithreading	2	856	September 28, 2021
Parallelize nested loop in v1.72 General Usage performance , parallel	5	721	August 11, 2022
Nested Loop optimization Performance	2	427	September 20, 2022
Parallel loop New to Julia	1	416	October 7, 2019

Parallelizing a Nested Loop Concurrently

Related topics