# Writing effective parallel code

I’m trying to think more seriously about writing Julia code that is either thread or worker parallel to make better use of high thread machines or individual cluster nodes with many CPUs. While I recognize that in general it is hard to give advice about the best way to write parallel code, and that the first and primary answer is always to benchmark the options in individual applications. But I’m hoping to get some kind of gestalt about what parallelizes well, and clearly my sensibilities about this are currently weak.

I’ve written some test code here of a problem that I would think parallelizes quite well: pass over a collection of matrices and factorize each of them. This little script tests the thread parallel (`JULIA_NUM_THREADS=XYZ julia...`) and the worker-parallel (`julia -p XYZ ...`) timings. I also threw in the timings from Transducers.jl out of curiosity and because I really like that package.

``````
using Distributed, Transducers, BenchmarkTools
@everywhere using LinearAlgebra

@everywhere term_apply(X) = factorize(X'X)

@everywhere function tmap(fun, VX)
funX1  = fun(VX[1])
out    = Vector{typeof(funX1)}(undef, length(VX))
out[1] = funX1
@inbounds out[j] = fun(VX[j])
end
return out
end

nmat     = 500
matrices = [randn(512, 512) for _ in 1:nmat]

println("Serial time:")
@btime map(term_apply, \$matrices)

@btime tmap(term_apply, \$matrices)

@btime tcollect(Map(term_apply), \$matrices)

println("Pmap parallel time:") # much worse with bigger batch size.
@btime pmap(term_apply, \$matrices)

println("Transducer worker-parallel time:")
@btime dcollect(Map(term_apply), \$matrices)

else
end
``````

I set the environment variables `OMP_NUM_THREADS` and `OPENBLAS_NUM_THREADS` to 1, and my exact Julia installation is

``````Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
uname: Linux 5.3.15-300.fc31.x86_64 #1 SMP Thu Dec 5 15:04:01 UTC 2019 x86_64 x86_64
CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz:
speed         user         nice          sys         idle          irq
#1  2777 MHz      18029 s          8 s       2263 s      84641 s        309 s
#2  2772 MHz      17801 s          4 s       2321 s      84719 s        424 s
#3  2782 MHz      17855 s          3 s       2235 s      85009 s        281 s
#4  2770 MHz      17865 s         12 s       2427 s      84551 s        414 s

Memory: 7.6603240966796875 GB (5744.65234375 MB free)
Uptime: 1057.0 sec
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
``````

and I ran the above script with

``````JULIA_NUM_THREADS=4 /usr/bin/julia -O3 test.jl
``````

and

``````JULIA_NUM_THREADS=1 /usr/bin/julia -O3 -p 4 test.jl
``````

The results were interesting. To summarize:

• The thread-parallel map-like function did improve as the thread count increased, but not by a huge amount. Going from one thread to four turned 870ms to 530ms, and the transducers `tcollect` clocked in at 520ms.
• The worker-parallel code really did not scale particularly well. Going from three workers to four made both the `pmap` and `dcollect` things slower. `pmap` for three workers clocked in at about 660ms. So there was improvement. For `dcollect` the timing was if anything very slightly worse than the serial version (unless somebody suggests that I’m doing something wrong, maybe it makes sense for me to file an issue about that).

So my questions:

• Is that speedup what can reasonably be expected? I didn’t expect scaling exactly linear in threads or workers, but on the other hand going from one thread to four not even halving the time is slightly discouraging for a task that seems perfectly parallelizable.
• Playing with the `batch_size` argument in `pmap` was a bit of a disaster. When I set it to `div(length(matrices), nworkers())`, my computer became very unhappy and unresponsive. Even after killing the process, it was so sluggish that I had to reboot. Judging from `htop` before it got very sluggish, this is because Julia asked for almost all the computer’s memory. Is there a sensible default choice for this variable?
• Is there a mindset that is better for writing parallel code than “make collections of things and pass over them”? I would think that something like this is really the best candidate for parallelization. But maybe I’m incorrect—maybe passing arrays around in threads/workers is an issue or something, and the better mindset is to try and distribute the lists ahead of time, do all the work, and then only collect them to one worker/thread after the heavy lifting is done.
• Does anybody have examples of code where parallelizing in one of the two ways above worked in the way one might naively expect? Like, where adding a second worker/thread halved the computation time?

In advance, I know v1.3 includes some very interesting new functionality. But it isn’t in Fedora’s official repos, and judging from the copr repo I gather that there are some issues building it for red hat linux. So for the moment I’m not making the upgrade.

1 Like

I can see how it’s not always practical to build from source and that you’d rather stay with what’s already packaged. But in this specific instance, whatever you’re going to learn from your investigations and this thread is outdated from the start, no? And 1.3 probably implements exactly the things that are going to come out as ‘best practices’ or ‘desired features’ of your investigation.

I’ve personally had good experiences with building Julia locally – are you sure you don’t want to give that a try if parallel Julia is important to you?

I didn’t have the impression that the parallel implementation of something like passing over a list would necessarily change with 1.3. Is that inaccurate? Looking at the docs, `Threads.@threads` is still there. Are the internals of that meaningfully different than in 1.2? Maybe I have misunderstood the scope of the thread-parallel development in 1.3.

I don’t think this particular setup is desirable for Distributed.jl. You are sending a bunch of arrays to workers so there is a significant serialization cost. When using Distributed.jl-based code, it’s better to try to organize code to minimize the serialization cost. For example, if you want to process data in many files, don’t load it in the main process but rather map over the list of files and load each file in the workers. Also, ideally, try to minimize the data you send back to the main process.

IIRC you couldn’t use `@threads` inside `@threads` before 1.3 so the functions using threads are not composable. 1.3 solves it. Also, the BLAS issue will be fixed at some point. In that case, you’d really want the latest Julia. I think “always use the latest release” is a good approach when it comes to multi-threading in Julia (or maybe actually for anything).

PS: thanks for benchmarking Transducers

2 Likes

Interesting, thank you for the issue references. I have set `export OPENBLAS_NUM_THREADS=1` in my bashrc, which I think Julia does see. So that shouldn’t be the issue. But from what you’re suggesting it sounds like I’ve underestimated the communication cost for multiple workers.

Is there any rule of thumb to be had with regard to the number of threads/workers and stuff? For example, my computer has a dual-core CPU and each core seems to have two threads (?). So I might naively think that `JULIA_NUM_THREADS=2 julia -p 2` would be a reasonable default instance if I wanted to give Julia 100% of my compute resources. But that isn’t as efficient as `NUM_THREADS=4` with only one worker.

Also, with regard to Transducers: thank you for writing it! Does it make sense to file a bug/issue or something for `dcollect` being significantly slower than `pmap`? Or is that expected behavior?

The threading scheduler changed in 1.3. Check out the news, it is called PARTR.
That is a big change, which will allow threading to eventually be robust with respect
to the various levels of parallelism (user versus library). So, I will definitely upgrade if I were you.

2 Likes

I probably should have said “threading support” rather than “multi-threading.” IIUC, OpenBLAS has memory management guarded by a lock, which can become a problem if you are using it from multiple Julia thread https://github.com/JuliaLang/julia/issues/34102#issuecomment-565864782. But I agree that’s a useful workaround if you are using it with Distributed.jl.

Your CPU probably has hyper-threading. If your program is compute-bound, it’s better to set `JULIA_NUM_THREADS` to number of physical cores. Check it with tools like `lscpu`.

I don’t think mixing process-based and thread-based parallelism is a good approach unless you use Distributed.jl with multiple machines.

Yes, it may be possible to improve `dcollect` so bug reports (ideally with MWE) are appreciated. But I’d say it’s expected given a long history of Distributed.jl.

2 Likes

That is very helpful—thank you @tkf! And @PetrKryslUCSD, I will definitely look into that. Looks like it’s time to build v1.3 myself and do some exploring.

Thanks to everybody for commenting. I think I have an at least slightly better gestalt about parallel coding now.