I’m trying to think more seriously about writing Julia code that is either thread or worker parallel to make better use of high thread machines or individual cluster nodes with many CPUs. While I recognize that in general it is hard to give advice about the best way to write parallel code, and that the first and primary answer is always to benchmark the options in individual applications. But I’m hoping to get some kind of gestalt about what parallelizes well, and clearly my sensibilities about this are currently weak.
I’ve written some test code here of a problem that I would think parallelizes quite well: pass over a collection of matrices and factorize each of them. This little script tests the thread parallel (JULIA_NUM_THREADS=XYZ julia...
) and the worker-parallel (julia -p XYZ ...
) timings. I also threw in the timings from Transducers.jl out of curiosity and because I really like that package.
using Distributed, Transducers, BenchmarkTools
@everywhere using LinearAlgebra
@everywhere term_apply(X) = factorize(X'X)
@everywhere function tmap(fun, VX)
funX1 = fun(VX[1])
out = Vector{typeof(funX1)}(undef, length(VX))
out[1] = funX1
Threads.@threads for j in 2:length(out)
@inbounds out[j] = fun(VX[j])
end
return out
end
nmat = 500
matrices = [randn(512, 512) for _ in 1:nmat]
println("Serial time:")
@btime map(term_apply, $matrices)
if Threads.nthreads() > 1 && nworkers() == 1
println("Thread-parallel time:")
@btime tmap(term_apply, $matrices)
println("Transducer thread-parallel time:")
@btime tcollect(Map(term_apply), $matrices)
elseif nworkers() > 1 && Threads.nthreads() == 1
println("Pmap parallel time:") # much worse with bigger batch size.
@btime pmap(term_apply, $matrices)
println("Transducer worker-parallel time:")
@btime dcollect(Map(term_apply), $matrices)
else
throw(error("Please use multiple threads OR workers."))
end
I set the environment variables OMP_NUM_THREADS
and OPENBLAS_NUM_THREADS
to 1, and my exact Julia installation is
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
uname: Linux 5.3.15-300.fc31.x86_64 #1 SMP Thu Dec 5 15:04:01 UTC 2019 x86_64 x86_64
CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz:
speed user nice sys idle irq
#1 2777 MHz 18029 s 8 s 2263 s 84641 s 309 s
#2 2772 MHz 17801 s 4 s 2321 s 84719 s 424 s
#3 2782 MHz 17855 s 3 s 2235 s 85009 s 281 s
#4 2770 MHz 17865 s 12 s 2427 s 84551 s 414 s
Memory: 7.6603240966796875 GB (5744.65234375 MB free)
Uptime: 1057.0 sec
Load Avg: 1.30517578125 1.1337890625 0.6298828125
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
and I ran the above script with
JULIA_NUM_THREADS=4 /usr/bin/julia -O3 test.jl
and
JULIA_NUM_THREADS=1 /usr/bin/julia -O3 -p 4 test.jl
The results were interesting. To summarize:
- The thread-parallel map-like function did improve as the thread count increased, but not by a huge amount. Going from one thread to four turned 870ms to 530ms, and the transducers
tcollect
clocked in at 520ms. - The worker-parallel code really did not scale particularly well. Going from three workers to four made both the
pmap
anddcollect
things slower.pmap
for three workers clocked in at about 660ms. So there was improvement. Fordcollect
the timing was if anything very slightly worse than the serial version (unless somebody suggests that I’m doing something wrong, maybe it makes sense for me to file an issue about that).
So my questions:
- Is that speedup what can reasonably be expected? I didn’t expect scaling exactly linear in threads or workers, but on the other hand going from one thread to four not even halving the time is slightly discouraging for a task that seems perfectly parallelizable.
- Playing with the
batch_size
argument inpmap
was a bit of a disaster. When I set it todiv(length(matrices), nworkers())
, my computer became very unhappy and unresponsive. Even after killing the process, it was so sluggish that I had to reboot. Judging fromhtop
before it got very sluggish, this is because Julia asked for almost all the computer’s memory. Is there a sensible default choice for this variable? - Is there a mindset that is better for writing parallel code than “make collections of things and pass over them”? I would think that something like this is really the best candidate for parallelization. But maybe I’m incorrect—maybe passing arrays around in threads/workers is an issue or something, and the better mindset is to try and distribute the lists ahead of time, do all the work, and then only collect them to one worker/thread after the heavy lifting is done.
- Does anybody have examples of code where parallelizing in one of the two ways above worked in the way one might naively expect? Like, where adding a second worker/thread halved the computation time?
In advance, I know v1.3 includes some very interesting new functionality. But it isn’t in Fedora’s official repos, and judging from the copr repo I gather that there are some issues building it for red hat linux. So for the moment I’m not making the upgrade.
I’m very interested to hear people’s thoughts. Thank you in advance for your reading and consideration.