How to preallocate for a parallel monte carlo simulation?

Take a look at this example:

You don’t necessarily need to use the package, but the pattern for chunking is shown there. The docs of the package also discuss other patterns.

Ps: if you will store independent matrices for each simulation and you can afford having a copy of all the data for each simulation, you can just use @threads in front of the loop over the simulations. If, however, you want to create only of the order of nthreads copies of the auxiliary buffers, you need to split the workload in chunks, as done there.

Using the package above, your code could be something like:

using ChunkSplitters
using Base.Threads
n_simulations = 1000
n_chunnks = Threads.nthreads() # can vary
out = [zeros(2, 2) for _ in 1:n_simulations]
work_vector = [zeros(2) for _ in 1:nchunks]
for p in parameters
    @threads for (i_range, i_chunk) in chunks(1:n_simulations, nchunks)
        for i in i_range
            monte_carlo_simulation!(p, work_vector[i_chunk], out[i]) # in-place changes to out[i]
        end
    end
    save_data(out)
end

This pattern can have (the maybe unimportant) false-sharing issue mentioned below (but that’s a performance issue, not a correctness one, and may be irrelevant for you).

Alternatively you could create the output matrix for each simulation inside the for i in i_range loop, and push or replace the values of the array of output matrices at the end of each simulation, in the case these output matrices are used within the simulation in hot loops (which may cause the false-sharing issue).

This is a working minimal example:

using ChunkSplitters
using Base.Threads

function simulation!(output, input, buffer)
	s = 0.0 # expensive stuff, just to take time
	for i in 2:10^3
		s += log(i)^(1/13)
	end
	buffer[1] = input
	output[1] = s * input * buffer[1]
end

function many_simulations2(n; nchunks = Threads.nthreads())
    inputs = 1:n
	outputs = [ zeros(1) for _ in 1:n ]
	buffer = [zeros(1) for _ in 1:nchunks]
	@threads for (sim_range, i_chunk) in chunks(1:n, nchunks)
	    for isim in sim_range
			simulation!(outputs[isim], inputs[isim], buffer[i_chunk])
		end
	end
	return outputs
end

This is a variation of the code above, using channels to store the buffers:

#
# Using channels to store buffers
#
function many_simulations(n; nchunks = Threads.nthreads())
    inputs = 1:n
    outputs = [ zeros(1) for _ in 1:n ]
	buffers = Channel{Vector{Float64}}(nchunks)
	for _ in 1:nchunks 
		put!(buffers, zeros(1))
	end
    @sync for (sim_range, _) in chunks(1:n, nchunks)
		@spawn begin
			buffer = take!(buffers)
			for isim in sim_range
				simulation!(outputs[isim], inputs[isim], buffer)
			end
			put!(buffers, buffer)
		end
    end
	return outputs
end

I think also that this (simpler) pattern will solve any possible false sharing associated with the buffers. This creates the buffers independently for each chunk of work, and locally in the loop.

#
# Allocate the buffer for each chunk: avoids false sharing
#
using ChunkSplitters, Base.Threads
function many_simulations3(n; nchunks = Threads.nthreads())
    inputs = 1:n
    outputs = [ zeros(1) for _ in 1:n ]
    @sync for (sim_range, _) in chunks(1:n, nchunks)
		@spawn begin
			local buffer = zeros(1) # obs: this "local" is redundant
			for isim in sim_range
				simulation!(outputs[isim], inputs[isim], buffer)
			end
		end
    end
	return outputs
end

(I don’t see any difference in performance in any of these functions, in the MWE).

4 Likes