Writing effective parallel code

I’m trying to think more seriously about writing Julia code that is either thread or worker parallel to make better use of high thread machines or individual cluster nodes with many CPUs. While I recognize that in general it is hard to give advice about the best way to write parallel code, and that the first and primary answer is always to benchmark the options in individual applications. But I’m hoping to get some kind of gestalt about what parallelizes well, and clearly my sensibilities about this are currently weak.

I’ve written some test code here of a problem that I would think parallelizes quite well: pass over a collection of matrices and factorize each of them. This little script tests the thread parallel (JULIA_NUM_THREADS=XYZ julia...) and the worker-parallel (julia -p XYZ ...) timings. I also threw in the timings from Transducers.jl out of curiosity and because I really like that package.


using Distributed, Transducers, BenchmarkTools
@everywhere using LinearAlgebra

@everywhere term_apply(X) = factorize(X'X)

@everywhere function tmap(fun, VX)
  funX1  = fun(VX[1])
  out    = Vector{typeof(funX1)}(undef, length(VX))
  out[1] = funX1
  Threads.@threads for j in 2:length(out)
    @inbounds out[j] = fun(VX[j])
  end
  return out
end

nmat     = 500
matrices = [randn(512, 512) for _ in 1:nmat]

println("Serial time:")
@btime map(term_apply, $matrices)

if Threads.nthreads() > 1 && nworkers() == 1

  println("Thread-parallel time:")
  @btime tmap(term_apply, $matrices)

  println("Transducer thread-parallel time:")
  @btime tcollect(Map(term_apply), $matrices)

elseif nworkers() > 1 && Threads.nthreads() == 1

  println("Pmap parallel time:") # much worse with bigger batch size.
  @btime pmap(term_apply, $matrices)

  println("Transducer worker-parallel time:")
  @btime dcollect(Map(term_apply), $matrices)

else
  throw(error("Please use multiple threads OR workers."))
end

I set the environment variables OMP_NUM_THREADS and OPENBLAS_NUM_THREADS to 1, and my exact Julia installation is

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  uname: Linux 5.3.15-300.fc31.x86_64 #1 SMP Thu Dec 5 15:04:01 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz: 
              speed         user         nice          sys         idle          irq
       #1  2777 MHz      18029 s          8 s       2263 s      84641 s        309 s
       #2  2772 MHz      17801 s          4 s       2321 s      84719 s        424 s
       #3  2782 MHz      17855 s          3 s       2235 s      85009 s        281 s
       #4  2770 MHz      17865 s         12 s       2427 s      84551 s        414 s
       
  Memory: 7.6603240966796875 GB (5744.65234375 MB free)
  Uptime: 1057.0 sec
  Load Avg:  1.30517578125  1.1337890625  0.6298828125
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

and I ran the above script with

JULIA_NUM_THREADS=4 /usr/bin/julia -O3 test.jl

and

JULIA_NUM_THREADS=1 /usr/bin/julia -O3 -p 4 test.jl

The results were interesting. To summarize:

  • The thread-parallel map-like function did improve as the thread count increased, but not by a huge amount. Going from one thread to four turned 870ms to 530ms, and the transducers tcollect clocked in at 520ms.
  • The worker-parallel code really did not scale particularly well. Going from three workers to four made both the pmap and dcollect things slower. pmap for three workers clocked in at about 660ms. So there was improvement. For dcollect the timing was if anything very slightly worse than the serial version (unless somebody suggests that I’m doing something wrong, maybe it makes sense for me to file an issue about that).

So my questions:

  • Is that speedup what can reasonably be expected? I didn’t expect scaling exactly linear in threads or workers, but on the other hand going from one thread to four not even halving the time is slightly discouraging for a task that seems perfectly parallelizable.
  • Playing with the batch_size argument in pmap was a bit of a disaster. When I set it to div(length(matrices), nworkers()), my computer became very unhappy and unresponsive. Even after killing the process, it was so sluggish that I had to reboot. Judging from htop before it got very sluggish, this is because Julia asked for almost all the computer’s memory. Is there a sensible default choice for this variable?
  • Is there a mindset that is better for writing parallel code than “make collections of things and pass over them”? I would think that something like this is really the best candidate for parallelization. But maybe I’m incorrect—maybe passing arrays around in threads/workers is an issue or something, and the better mindset is to try and distribute the lists ahead of time, do all the work, and then only collect them to one worker/thread after the heavy lifting is done.
  • Does anybody have examples of code where parallelizing in one of the two ways above worked in the way one might naively expect? Like, where adding a second worker/thread halved the computation time?

In advance, I know v1.3 includes some very interesting new functionality. But it isn’t in Fedora’s official repos, and judging from the copr repo I gather that there are some issues building it for red hat linux. So for the moment I’m not making the upgrade.

I’m very interested to hear people’s thoughts. Thank you in advance for your reading and consideration.

I can see how it’s not always practical to build from source and that you’d rather stay with what’s already packaged. But in this specific instance, whatever you’re going to learn from your investigations and this thread is outdated from the start, no? And 1.3 probably implements exactly the things that are going to come out as ‘best practices’ or ‘desired features’ of your investigation.

I’ve personally had good experiences with building Julia locally – are you sure you don’t want to give that a try if parallel Julia is important to you?

I didn’t have the impression that the parallel implementation of something like passing over a list would necessarily change with 1.3. Is that inaccurate? Looking at the docs, Threads.@threads is still there. Are the internals of that meaningfully different than in 1.2? Maybe I have misunderstood the scope of the thread-parallel development in 1.3.

I don’t think this particular setup is desirable for Distributed.jl. You are sending a bunch of arrays to workers so there is a significant serialization cost. When using Distributed.jl-based code, it’s better to try to organize code to minimize the serialization cost. For example, if you want to process data in many files, don’t load it in the main process but rather map over the list of files and load each file in the workers. Also, ideally, try to minimize the data you send back to the main process.

In the threading example, maybe it’s conflicting BLAS’s multi-threading? See also:

IIRC you couldn’t use @threads inside @threads before 1.3 so the functions using threads are not composable. 1.3 solves it. Also, the BLAS issue will be fixed at some point. In that case, you’d really want the latest Julia. I think “always use the latest release” is a good approach when it comes to multi-threading in Julia (or maybe actually for anything).

PS: thanks for benchmarking Transducers :+1:

1 Like

Interesting, thank you for the issue references. I have set export OPENBLAS_NUM_THREADS=1 in my bashrc, which I think Julia does see. So that shouldn’t be the issue. But from what you’re suggesting it sounds like I’ve underestimated the communication cost for multiple workers.

Is there any rule of thumb to be had with regard to the number of threads/workers and stuff? For example, my computer has a dual-core CPU and each core seems to have two threads (?). So I might naively think that JULIA_NUM_THREADS=2 julia -p 2 would be a reasonable default instance if I wanted to give Julia 100% of my compute resources. But that isn’t as efficient as NUM_THREADS=4 with only one worker.

Also, with regard to Transducers: thank you for writing it! Does it make sense to file a bug/issue or something for dcollect being significantly slower than pmap? Or is that expected behavior?

The threading scheduler changed in 1.3. Check out the news, it is called PARTR.
That is a big change, which will allow threading to eventually be robust with respect
to the various levels of parallelism (user versus library). So, I will definitely upgrade if I were you.

2 Likes

I probably should have said “threading support” rather than “multi-threading.” IIUC, OpenBLAS has memory management guarded by a lock, which can become a problem if you are using it from multiple Julia thread https://github.com/JuliaLang/julia/issues/34102#issuecomment-565864782. But I agree that’s a useful workaround if you are using it with Distributed.jl.

Your CPU probably has hyper-threading. If your program is compute-bound, it’s better to set JULIA_NUM_THREADS to number of physical cores. Check it with tools like lscpu.

I don’t think mixing process-based and thread-based parallelism is a good approach unless you use Distributed.jl with multiple machines.

Yes, it may be possible to improve dcollect so bug reports (ideally with MWE) are appreciated. But I’d say it’s expected given a long history of Distributed.jl.

2 Likes

That is very helpful—thank you @tkf! And @PetrKryslUCSD, I will definitely look into that. Looks like it’s time to build v1.3 myself and do some exploring.

Thanks to everybody for commenting. I think I have an at least slightly better gestalt about parallel coding now.

Why not just download the generic Linux binary from https://julialang.org/downloads/?

3 Likes