An embarrassingly parallel problem: threads or MPI?

Monte Carlo is one of the archetypal embarrassingly parallel problems. I have been wondering for some time what’s the fastest way to do Monte Carlo in Julia, for problems where each replication has enough cost to make parallelization of some benefit. Using threads is a very easy and transparent possibility, one just needs to put Threads.@threads in front of the main Monte Carlo loop, but there have been some reports of threads not scaling so well, for whatever reason. Another solution is MPI. With MPI.jl (https://github.com/JuliaParallel/MPI.jl), using MPI with Julia is pretty straightforward. There are other possibilities, of course, but I’m working with what is comfortable for me.

I took a problem I’m currently working on and implemented Monte Carlo both with threads and with MPI. The actual research problem that is the core of each replication is MCMC applied to an ABC style criterion that uses statistics passed though a trained neural net. The details of that are perhaps not relevant here. Running the code on a single thread takes about 20 seconds or so, so it’s a bit costly to do 1000 replications. How well this inner problem is optimized is not the topic I’m worried about here, I just want to see how one might speed up some typical user code that may not be optimized very well. Of course, results could perhaps change with optimization, but at least both methods are using the same core code here.

So, the code I have run is at https://github.com/mcreel/SNM/tree/master/examples/Auction. To run it, activate and instantiate the project in the SNM directory. Then, if needed, run BuildMPI.jl in the examples/Auction directory to use the system MPI libraries. The two files to run are AuctionMC_threads.jl and AuctionMC_mpi.jl. I do this from the Linux terminal prompt as follows, where X is the number of threads or MPI workers to use:

export JULIA_NUM_THREADS=X
time julia AuctionMC_threads.jl

or

mpirun -np X+1 julia AuctionMC_mpi.jl

The MPI code uses the first rank to gather results, but not execute the main code, which is why the call uses X+1 mpi ranks.

The results I get on a machine with 4 AMD Opteron 6380 CPUs (32 total cores, with hyperthreading) using Open MPI, and Julia 1.5.2, are as follows:

Base time, single thread 
22m34s

threads
20: 9m5s
10: 8m13s
5: 9m26s
2: 13m21s

mpi
20: 7m34s
10: 7m15s
5: 8m5s
2: 12m54s

My conclusions, which are in line with my priors that I have built up from less formal comparisons, are that MPI is a little better at making use of the cores, but the advantage is not great. Using MPI is harder to do than threads. I suppose that I prefer threads for the greater convenience.

I’d personally be pretty biased towards threads for this mostly because of the convenience and the fact that julia threads they’ll cooperate with other multi-threaded julia package code you might happen to run instead of interfering like I suspect MPI would do.

1 Like

I guess an important question might be if you want to scale up with multiple machines.

5 Likes

I think it’s also with mentioning that Julia’s threading is still only a year or so old. I suspect that it will be a major focus of 1.7

1 Like

Out of curiosity, can you add @distributed or pmap? It’s also not as convenient as @threads, but I’ve often found it to scale better.
Like MPI, Distributed can scale to multiple machines (if it comes to that).

4 Likes

Yeah, multi-process solutions like MPI and Distributed might be better if you need to create many objects (at least with current julia).

1 Like

Shameless plug: BTW, I’m interested in abstracting out executing mechanisms such as Threads vs Distributed. Transducers.jl supports both of these execution mechanisms and you can use FLoops.jl to choose an appropriate “executor” of a for loop. So, in principle, you can just write a single for loop and then choose between threading and processing afterwards.

4 Likes

I’ll be happy to run this same code again with future versions of Julia. I already use threads for the most part on this machine, having switched from MPI.

I haven’t ever invested the time to learn how to use these methods, and I’m afraid that now’s not the time for me. It would be nice to see a broader comparison than what I’ve done.

This is one of the reasons I’ve always liked MPI: write and test code on a laptop, and then run on a cluster. I am now using threads for the most part, though, because I no longer use a cluster.

If you include compilation time in your measurements, 20 seconds total running time will be way too little to make any conclusions.

Do the single jobs always run for 20 seconds or is there need for load balancing?

The times are for 100 Monte Carlo replications. The running time for a single replication on a single thread, not counting compilation time, is roughly 20 seconds. The total running time on a single thread is about 22 minutes. So, even though compilation is included here, it’s not a major factor, and the differences between MPI and threads are apparent.

2 Likes