An embarrassingly parallel problem: threads or MPI?

mcreel · October 9, 2020, 4:28pm

Monte Carlo is one of the archetypal embarrassingly parallel problems. I have been wondering for some time what’s the fastest way to do Monte Carlo in Julia, for problems where each replication has enough cost to make parallelization of some benefit. Using threads is a very easy and transparent possibility, one just needs to put Threads.@threads in front of the main Monte Carlo loop, but there have been some reports of threads not scaling so well, for whatever reason. Another solution is MPI. With MPI.jl (GitHub - JuliaParallel/MPI.jl: MPI wrappers for Julia), using MPI with Julia is pretty straightforward. There are other possibilities, of course, but I’m working with what is comfortable for me.

I took a problem I’m currently working on and implemented Monte Carlo both with threads and with MPI. The actual research problem that is the core of each replication is MCMC applied to an ABC style criterion that uses statistics passed though a trained neural net. The details of that are perhaps not relevant here. Running the code on a single thread takes about 20 seconds or so, so it’s a bit costly to do 1000 replications. How well this inner problem is optimized is not the topic I’m worried about here, I just want to see how one might speed up some typical user code that may not be optimized very well. Of course, results could perhaps change with optimization, but at least both methods are using the same core code here.

So, the code I have run is at https://github.com/mcreel/SNM/tree/master/examples/Auction. To run it, activate and instantiate the project in the SNM directory. Then, if needed, run BuildMPI.jl in the examples/Auction directory to use the system MPI libraries. The two files to run are AuctionMC_threads.jl and AuctionMC_mpi.jl. I do this from the Linux terminal prompt as follows, where X is the number of threads or MPI workers to use:

export JULIA_NUM_THREADS=X
time julia AuctionMC_threads.jl

or

mpirun -np X+1 julia AuctionMC_mpi.jl

The MPI code uses the first rank to gather results, but not execute the main code, which is why the call uses X+1 mpi ranks.

The results I get on a machine with 4 AMD Opteron 6380 CPUs (32 total cores, with hyperthreading) using Open MPI, and Julia 1.5.2, are as follows:

Base time, single thread 
22m34s

threads
20: 9m5s
10: 8m13s
5: 9m26s
2: 13m21s

mpi
20: 7m34s
10: 7m15s
5: 8m5s
2: 12m54s

My conclusions, which are in line with my priors that I have built up from less formal comparisons, are that MPI is a little better at making use of the cores, but the advantage is not great. Using MPI is harder to do than threads. I suppose that I prefer threads for the greater convenience.

Mason · October 9, 2020, 4:36pm

I’d personally be pretty biased towards threads for this mostly because of the convenience and the fact that julia threads they’ll cooperate with other multi-threaded julia package code you might happen to run instead of interfering like I suspect MPI would do.

tkf · October 9, 2020, 4:52pm

I guess an important question might be if you want to scale up with multiple machines.

Oscar_Smith · October 9, 2020, 5:13pm

I think it’s also with mentioning that Julia’s threading is still only a year or so old. I suspect that it will be a major focus of 1.7

Elrod · October 9, 2020, 5:38pm

Out of curiosity, can you add @distributed or pmap? It’s also not as convenient as @threads, but I’ve often found it to scale better.
Like MPI, Distributed can scale to multiple machines (if it comes to that).

tkf · October 9, 2020, 9:21pm

Yeah, multi-process solutions like MPI and Distributed might be better if you need to create many objects (at least with current julia).

tkf · October 10, 2020, 1:01am

Shameless plug: BTW, I’m interested in abstracting out executing mechanisms such as Threads vs Distributed. Transducers.jl supports both of these execution mechanisms and you can use FLoops.jl to choose an appropriate “executor” of a for loop. So, in principle, you can just write a single for loop and then choose between threading and processing afterwards.

mcreel · October 10, 2020, 5:44am

I’ll be happy to run this same code again with future versions of Julia. I already use threads for the most part on this machine, having switched from MPI.

mcreel · October 10, 2020, 5:48am

I haven’t ever invested the time to learn how to use these methods, and I’m afraid that now’s not the time for me. It would be nice to see a broader comparison than what I’ve done.

mcreel · October 10, 2020, 5:50am

This is one of the reasons I’ve always liked MPI: write and test code on a laptop, and then run on a cluster. I am now using threads for the most part, though, because I no longer use a cluster.

mikkoku · October 10, 2020, 6:07am

If you include compilation time in your measurements, 20 seconds total running time will be way too little to make any conclusions.

Do the single jobs always run for 20 seconds or is there need for load balancing?

mcreel · October 10, 2020, 6:22am

The times are for 100 Monte Carlo replications. The running time for a single replication on a single thread, not counting compilation time, is roughly 20 seconds. The total running time on a single thread is about 22 minutes. So, even though compilation is included here, it’s not a major factor, and the differences between MPI and threads are apparent.

mcreel · June 10, 2021, 4:21pm

In case anyone stumbles on this, I was not aware that randn is not thread safe. Thread performance could probably be improved using information here: Question about Multi-threading Performance

lmiq · June 10, 2021, 4:52pm

I think it is now: Random number and parallel execution - #16 by greg_plowman

mcreel · June 10, 2021, 6:10pm

Ah, thanks, that’s good to know.

Topic		Replies	Views
Is multi-threading in Julia more like openMP or MPI? General Usage	2	3014	August 6, 2021
Multithreading and pmap Julia at Scale	8	2737	January 5, 2019
What type of code is `threads` good for? General Usage	12	1687	July 26, 2019
Embarrassingly parallel multi-threading doesn't scale Performance multithreading	17	1614	October 16, 2021
Multi-threading on a 2 CPU system New to Julia multithreading	15	1084	February 2, 2023

An embarrassingly parallel problem: threads or MPI?

Related topics