Take a look at this example:
You don’t necessarily need to use the package, but the pattern for chunking is shown there. The docs of the package also discuss other patterns.
Ps: if you will store independent matrices for each simulation and you can afford having a copy of all the data for each simulation, you can just use @threads
in front of the loop over the simulations. If, however, you want to create only of the order of nthreads
copies of the auxiliary buffers, you need to split the workload in chunks, as done there.
Using the package above, your code could be something like:
using ChunkSplitters
using Base.Threads
n_simulations = 1000
n_chunnks = Threads.nthreads() # can vary
out = [zeros(2, 2) for _ in 1:n_simulations]
work_vector = [zeros(2) for _ in 1:nchunks]
for p in parameters
@threads for (i_range, i_chunk) in chunks(1:n_simulations, nchunks)
for i in i_range
monte_carlo_simulation!(p, work_vector[i_chunk], out[i]) # in-place changes to out[i]
end
end
save_data(out)
end
This pattern can have (the maybe unimportant) false-sharing issue mentioned below (but that’s a performance issue, not a correctness one, and may be irrelevant for you).
Alternatively you could create the output matrix for each simulation inside the for i in i_range
loop, and push or replace the values of the array of output matrices at the end of each simulation, in the case these output matrices are used within the simulation in hot loops (which may cause the false-sharing issue).
This is a working minimal example:
using ChunkSplitters
using Base.Threads
function simulation!(output, input, buffer)
s = 0.0 # expensive stuff, just to take time
for i in 2:10^3
s += log(i)^(1/13)
end
buffer[1] = input
output[1] = s * input * buffer[1]
end
function many_simulations2(n; nchunks = Threads.nthreads())
inputs = 1:n
outputs = [ zeros(1) for _ in 1:n ]
buffer = [zeros(1) for _ in 1:nchunks]
@threads for (sim_range, i_chunk) in chunks(1:n, nchunks)
for isim in sim_range
simulation!(outputs[isim], inputs[isim], buffer[i_chunk])
end
end
return outputs
end
This is a variation of the code above, using channels to store the buffers:
#
# Using channels to store buffers
#
function many_simulations(n; nchunks = Threads.nthreads())
inputs = 1:n
outputs = [ zeros(1) for _ in 1:n ]
buffers = Channel{Vector{Float64}}(nchunks)
for _ in 1:nchunks
put!(buffers, zeros(1))
end
@sync for (sim_range, _) in chunks(1:n, nchunks)
@spawn begin
buffer = take!(buffers)
for isim in sim_range
simulation!(outputs[isim], inputs[isim], buffer)
end
put!(buffers, buffer)
end
end
return outputs
end
I think also that this (simpler) pattern will solve any possible false sharing associated with the buffers. This creates the buffers independently for each chunk of work, and locally in the loop.
#
# Allocate the buffer for each chunk: avoids false sharing
#
using ChunkSplitters, Base.Threads
function many_simulations3(n; nchunks = Threads.nthreads())
inputs = 1:n
outputs = [ zeros(1) for _ in 1:n ]
@sync for (sim_range, _) in chunks(1:n, nchunks)
@spawn begin
local buffer = zeros(1) # obs: this "local" is redundant
for isim in sim_range
simulation!(outputs[isim], inputs[isim], buffer)
end
end
end
return outputs
end
(I don’t see any difference in performance in any of these functions, in the MWE).