I have a numerical calculation (market clearing in a heterogeneous agent overlapping generations model) which is “embarrassingly parallel” in the sense that I can split it up to 10–30 parts that can run separately (the cohorts).
Using a single thread, all parts run for about 0.2s (total). Using 8 threads on my laptop (ThreadTools.jl and ThreadsX.jl, same result), this increases to 5s. This is on Julia 1.5.
So far I have not been able to make an MWE, the code is somewhat large and small examples fail to replicate this. The computation is memory intensive with nonlocal memory access patterns (this is somewhat inevitable, despite best efforts to mitigate it).
Any advice would be appreciated on
how to benchmark and debug this,
whether worse performance using threads is something to be expected in these scenarios,
whether going to a machine with more CPU cache and memory bandwidth could change these results.
I realize that without an MWE this is a vague problem, so I am grateful for vague shots in the dark too on how to deal with this (“just don’t use threads” is an option, so I don’t waste more time on this).
Is it feasible to try @distributed/pmap (preferably without using SharedArrays)?
If you see good performance with them, that would let you rule out CPU cache/memory bandwidth, and point towards something like false sharing as the culprit (trying to eliminate those as causes is why I suggested avoiding SharedArray).
Another thought: are you using BLAS at all? In my experience, Julia multi-threading does not play nice with BLAS multithreading, although I have never experienced it working that much worse than a single-thread performance (ref). I’m always sure to use BLAS.set_num_threads(1) when I want thread-parallel and linear algebra (thanks again, @Elrod).
I have found that threads can help to reduce time, but only up to a point. I have a machine with 4 oldish AMD quad core CPUs, so 16 physical cores. I also run embarrassingly parallel jobs, and I often find that using only 8 or so threads is the sweet spot. I think it depends strongly on cache contention. Too many cores make the usage of L2 cache inefficient, and time can go above the single thread time. This was the explanation a computer scientist gave me. I have used MPI in the past for similar problems, and my impression was that the sweet spot used more cores, but this only a vague idea, unsupported by careful benchmarking.
Many old AMD CPUs (i.e., the bulldozer line) had core pairs share floating point units, so those 16 physical cores may have only had 8 distinct floating point units.
a Bulldozer CMT module is equal to a dual-core processor in its integer calculation capabilities, and to either a single-core processor or a handicapped dual-core in terms of floating-point computational power, depending on whether the code is saturated in floating point instructions in both threads running on the same CMT module, and whether the FPU is performing 128-bit or 256-bit floating point operations. The reason for this is that for each two integer cores, that is, within the same module, there is a single floating-point unit consisting of a pair of 128-bit FMACexecution units.
Hmm, these are AMD Barcelona Opterons, which, according to Wikipedia, precede the Bulldozer line. So, I imagine that they may have this “feature”. That may explain why my informal experimentation more or less settled on using 8 threads. Thanks!
That is my experience as well. Ideally no allocations in the threaded code. I am not sure to what extend the GC is multi-threaded nowadays. When I implemented the first prototype there was a big lock around all GC calls.
I have been comparing a fully multi-threaded C++ library (OpenMP) with my own Julia library and after quite some optimization I am now getting similar speedups when increasing the number of threads. Its not that predictable as in C++ (i.e. when I run the code several times, I get more deviations (bad outliers) for the Julia code than for the C++ code. Further it seems that Julia has more issues with large number of threadsand I thus restricted my analysis to less than 12 threads in the end.
Do you use the nonlocal access pattern to read? Or to write? If it’s read-only, I wouldn’t worry too much. If it were write, is it possible to allocate the destination array/object per thread, use it in each thread, and then merge them afterwards?
We had similar problems with DynamicGrids.jl trying to run it on 48 and 72 core servers. The results also weren’t so good.
First we cut down allocations to basically nothing, which improved things. But the problem ended up looking like it was scaling of the shared resources like L3 cache, and maybe math processors too. Arrays getting bumped in and out of cache and RAM seems to slow things down quite a bit - which allocations will do, but having more data than will fit in cache will also do.
It was hard to find a single server that could beat my desktop, as the scaling meant 6 4Ghz CPUs could pretty much match 48 or even the 72 2Ghz server CPUs, which were barely any better than the 48.
Multiple distributed machines could be the way to get around this, but we gave up at that point and just used my desktop for simplicity.
For embarrassingly parallel problems which may allocate, do I/O, etc., is there reason to expect that MPI would outperform threads, or should the methods perform about the same? A while ago, I published a paper giving examples of MPI with Julia for embarrassingly simple problems like Monte Carlo. (wp version is at https://github.com/mcreel/JuliaMPIMonteCarlo.jl). There, I found that, for code which has a bit of computational demand (similar to particle filtering), I could get a slightly better than 9X speedup running using 26 MPI ranks. The machine is the same 16 core machine I described above (32 cores with hyperthreading). I don’t recall why I used 26 ranks, possibly that was the number that gave best performance. Is there reason to expect that this sort of problem, using threads, would also get a similar speedup?
I may update the code for Julia 1.x, and try with threads, but probably not soon.
Yeah, if you need to allocate a lot of things, I think Distributed is better. The binary trees benchmark game (which allocates a lot of nodes) is using Distributed instead Threads for the same reason: Help with binary trees benchmark games example
Thanks again for all the help and suggestions. We managed to eliminate most of the allocations, and this does help the threaded code a lot (using ThreadsX.mapreduce). Below I include benchmark output on a Ryzen 3600 (6 cores):
$ julia -t 1 --project=@. script/mass_benchmark.jl
132.305 ms (747 allocations: 429.94 KiB)
$ julia -t 2 --project=@. script/mass_benchmark.jl
65.260 ms (1478 allocations: 469.08 KiB)
$ julia -t 4 --project=@. script/mass_benchmark.jl
33.919 ms (2202 allocations: 508.06 KiB)
$ julia -t 6 --project=@. script/mass_benchmark.jl
22.949 ms (2205 allocations: 508.16 KiB)
$ julia -t 8 --project=@. script/mass_benchmark.jl
22.498 ms (2207 allocations: 508.22 KiB)
I suspect that the cache size is really important, since on my laptop (i5-8250U, 4 cores) performance degrades even for 2 threads.