Monte Carlo is one of the archetypal embarrassingly parallel problems. I have been wondering for some time what’s the fastest way to do Monte Carlo in Julia, for problems where each replication has enough cost to make parallelization of some benefit. Using threads is a very easy and transparent possibility, one just needs to put Threads.@threads in front of the main Monte Carlo loop, but there have been some reports of threads not scaling so well, for whatever reason. Another solution is MPI. With MPI.jl (GitHub - JuliaParallel/MPI.jl: MPI wrappers for Julia), using MPI with Julia is pretty straightforward. There are other possibilities, of course, but I’m working with what is comfortable for me.
I took a problem I’m currently working on and implemented Monte Carlo both with threads and with MPI. The actual research problem that is the core of each replication is MCMC applied to an ABC style criterion that uses statistics passed though a trained neural net. The details of that are perhaps not relevant here. Running the code on a single thread takes about 20 seconds or so, so it’s a bit costly to do 1000 replications. How well this inner problem is optimized is not the topic I’m worried about here, I just want to see how one might speed up some typical user code that may not be optimized very well. Of course, results could perhaps change with optimization, but at least both methods are using the same core code here.
So, the code I have run is at https://github.com/mcreel/SNM/tree/master/examples/Auction. To run it, activate and instantiate the project in the SNM directory. Then, if needed, run BuildMPI.jl in the examples/Auction directory to use the system MPI libraries. The two files to run are AuctionMC_threads.jl and AuctionMC_mpi.jl. I do this from the Linux terminal prompt as follows, where X is the number of threads or MPI workers to use:
export JULIA_NUM_THREADS=X
time julia AuctionMC_threads.jl
or
mpirun -np X+1 julia AuctionMC_mpi.jl
The MPI code uses the first rank to gather results, but not execute the main code, which is why the call uses X+1 mpi ranks.
The results I get on a machine with 4 AMD Opteron 6380 CPUs (32 total cores, with hyperthreading) using Open MPI, and Julia 1.5.2, are as follows:
Base time, single thread
22m34s
threads
20: 9m5s
10: 8m13s
5: 9m26s
2: 13m21s
mpi
20: 7m34s
10: 7m15s
5: 8m5s
2: 12m54s
My conclusions, which are in line with my priors that I have built up from less formal comparisons, are that MPI is a little better at making use of the cores, but the advantage is not great. Using MPI is harder to do than threads. I suppose that I prefer threads for the greater convenience.