Writing effective parallel code

cgeoga · December 17, 2019, 8:44pm

I’m trying to think more seriously about writing Julia code that is either thread or worker parallel to make better use of high thread machines or individual cluster nodes with many CPUs. While I recognize that in general it is hard to give advice about the best way to write parallel code, and that the first and primary answer is always to benchmark the options in individual applications. But I’m hoping to get some kind of gestalt about what parallelizes well, and clearly my sensibilities about this are currently weak.

I’ve written some test code here of a problem that I would think parallelizes quite well: pass over a collection of matrices and factorize each of them. This little script tests the thread parallel (JULIA_NUM_THREADS=XYZ julia...) and the worker-parallel (julia -p XYZ ...) timings. I also threw in the timings from Transducers.jl out of curiosity and because I really like that package.


using Distributed, Transducers, BenchmarkTools
@everywhere using LinearAlgebra

@everywhere term_apply(X) = factorize(X'X)

@everywhere function tmap(fun, VX)
  funX1  = fun(VX[1])
  out    = Vector{typeof(funX1)}(undef, length(VX))
  out[1] = funX1
  Threads.@threads for j in 2:length(out)
    @inbounds out[j] = fun(VX[j])
  end
  return out
end

nmat     = 500
matrices = [randn(512, 512) for _ in 1:nmat]

println("Serial time:")
@btime map(term_apply, $matrices)

if Threads.nthreads() > 1 && nworkers() == 1

  println("Thread-parallel time:")
  @btime tmap(term_apply, $matrices)

  println("Transducer thread-parallel time:")
  @btime tcollect(Map(term_apply), $matrices)

elseif nworkers() > 1 && Threads.nthreads() == 1

  println("Pmap parallel time:") # much worse with bigger batch size.
  @btime pmap(term_apply, $matrices)

  println("Transducer worker-parallel time:")
  @btime dcollect(Map(term_apply), $matrices)

else
  throw(error("Please use multiple threads OR workers."))
end

I set the environment variables OMP_NUM_THREADS and OPENBLAS_NUM_THREADS to 1, and my exact Julia installation is

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  uname: Linux 5.3.15-300.fc31.x86_64 #1 SMP Thu Dec 5 15:04:01 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz: 
              speed         user         nice          sys         idle          irq
       #1  2777 MHz      18029 s          8 s       2263 s      84641 s        309 s
       #2  2772 MHz      17801 s          4 s       2321 s      84719 s        424 s
       #3  2782 MHz      17855 s          3 s       2235 s      85009 s        281 s
       #4  2770 MHz      17865 s         12 s       2427 s      84551 s        414 s
       
  Memory: 7.6603240966796875 GB (5744.65234375 MB free)
  Uptime: 1057.0 sec
  Load Avg:  1.30517578125  1.1337890625  0.6298828125
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

and I ran the above script with

JULIA_NUM_THREADS=4 /usr/bin/julia -O3 test.jl

and

JULIA_NUM_THREADS=1 /usr/bin/julia -O3 -p 4 test.jl

The results were interesting. To summarize:

The thread-parallel map-like function did improve as the thread count increased, but not by a huge amount. Going from one thread to four turned 870ms to 530ms, and the transducers tcollect clocked in at 520ms.
The worker-parallel code really did not scale particularly well. Going from three workers to four made both the pmap and dcollect things slower. pmap for three workers clocked in at about 660ms. So there was improvement. For dcollect the timing was if anything very slightly worse than the serial version (unless somebody suggests that I’m doing something wrong, maybe it makes sense for me to file an issue about that).

So my questions:

Is that speedup what can reasonably be expected? I didn’t expect scaling exactly linear in threads or workers, but on the other hand going from one thread to four not even halving the time is slightly discouraging for a task that seems perfectly parallelizable.
Playing with the batch_size argument in pmap was a bit of a disaster. When I set it to div(length(matrices), nworkers()), my computer became very unhappy and unresponsive. Even after killing the process, it was so sluggish that I had to reboot. Judging from htop before it got very sluggish, this is because Julia asked for almost all the computer’s memory. Is there a sensible default choice for this variable?
Is there a mindset that is better for writing parallel code than “make collections of things and pass over them”? I would think that something like this is really the best candidate for parallelization. But maybe I’m incorrect—maybe passing arrays around in threads/workers is an issue or something, and the better mindset is to try and distribute the lists ahead of time, do all the work, and then only collect them to one worker/thread after the heavy lifting is done.
Does anybody have examples of code where parallelizing in one of the two ways above worked in the way one might naively expect? Like, where adding a second worker/thread halved the computation time?

In advance, I know v1.3 includes some very interesting new functionality. But it isn’t in Fedora’s official repos, and judging from the copr repo I gather that there are some issues building it for red hat linux. So for the moment I’m not making the upgrade.

I’m very interested to hear people’s thoughts. Thank you in advance for your reading and consideration.

tkluck · December 17, 2019, 8:55pm

I can see how it’s not always practical to build from source and that you’d rather stay with what’s already packaged. But in this specific instance, whatever you’re going to learn from your investigations and this thread is outdated from the start, no? And 1.3 probably implements exactly the things that are going to come out as ‘best practices’ or ‘desired features’ of your investigation.

I’ve personally had good experiences with building Julia locally – are you sure you don’t want to give that a try if parallel Julia is important to you?

cgeoga · December 17, 2019, 9:03pm

I didn’t have the impression that the parallel implementation of something like passing over a list would necessarily change with 1.3. Is that inaccurate? Looking at the docs, Threads.@threads is still there. Are the internals of that meaningfully different than in 1.2? Maybe I have misunderstood the scope of the thread-parallel development in 1.3.

tkf · December 17, 2019, 10:02pm

I don’t think this particular setup is desirable for Distributed.jl. You are sending a bunch of arrays to workers so there is a significant serialization cost. When using Distributed.jl-based code, it’s better to try to organize code to minimize the serialization cost. For example, if you want to process data in many files, don’t load it in the main process but rather map over the list of files and load each file in the workers. Also, ideally, try to minimize the data you send back to the main process.

In the threading example, maybe it’s conflicting BLAS’s multi-threading? See also:

github.com/JuliaLang/julia

Does the number of threads affect the executed code?

opened 02:45PM - 14 Dec 19 UTC

closed 08:25PM - 29 Sep 21 UTC

PetrKryslUCSD

parallelism needs more info

The work is proportional to the number of elements. For a mesh of 128000 elemen…ts both a serial and 1-thread simulation carry out the computational work in 2.5 seconds. For a mesh of 1024000 elements both a serial and 1-thread simulation carry out the computational work in around 20.0 seconds. So, eight times more work, eight times longer. Now comes the weird part. When I use 2 threads, so that each thread works on 512000 elements, the amount of work per thread is 10 seconds. However the work procedure shows that it consumes around 16.5 seconds. When I use 4 threads, each thread works on 256,000 elements, and consequently the work procedure should execute in 5 seconds. However, the work procedure actually shows that it consumes roughly 15.6 seconds. With 8 threads, each thread works on 128,000 elements, and the work procedure should only take 2.5 seconds. However, it reports to take roughly 14 seconds. The threaded execution therefore looks like this: Number of elements Number of threads Execution time per thread 1024000 1 20 512000 2 16.5 256000 4 15.6 128000 8 14 The weird thing is I time the interior of the work procedure. So that should exclude any overhead associated with threading. However, as you can see the number of threads actually affects how much time the work procedure spends doing the work. The total amount of time farming out the work to the threads is very small. The total amount of time collecting the data with `wait` pretty much is equal to the amount of time reported by the work procedure. As if the overhead related to threading was very small. The whole thing can be exercised by ``` git clone https://github.com/PetrKryslUCSD/FinEtoolsDeforNonlinear.jl ``` followed by ``` cd FinEtoolsDeforNonlinear.jl export JULIA_NUM_THREADS=8 julia ``` and ``` include("threaded_test.jl") ``` I'm sorry I don't have a more minimal working example!

github.com/JuliaLang/julia

partr thread support for openblas

opened 07:29PM - 04 Aug 19 UTC

closed 01:26AM - 30 Jan 22 UTC

ViralBShah

linear algebra multithreading

Here are some notes from digging into the openblas codebase (with @stevengj) to …enable partr threading support. 1. [`exec_blas`](https://github.com/xianyi/OpenBLAS/blob/96a794e9fd9fdc2b03a01b3dabd0a10006d0aa98/driver/others/blas_server_omp.c#L308) is called by all the routines. The code pattern followed is setting up the work queue and calling `exec_blas` to do all the work through an [openmp pragma](https://github.com/xianyi/OpenBLAS/blob/96a794e9fd9fdc2b03a01b3dabd0a10006d0aa98/driver/others/blas_server_omp.c#L338). 2. The exception is lapack routines, which also use the `exec_blas_async` functions. 3. The openmp backend doesn’t seem to implement the async and thus I believe that it will not multi-thread the lapack calls. 4. [Windows](https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server_win32.c) has its own threading backend The easiest way may be to modify the openmp threading backend, which seems amenable to something like the [fftw partr backend](https://github.com/JuliaMath/FFTW.jl/pull/105). To start with, we should ignore lapack threading. We could probably just implement an `exec_blas_async` fallback that calls `exec_blas` (and make `exec_blas_async_wait` a no-op). All of this should work on windows too, although the going through the openmp build route may need some work on the makefiles. The [patch to FFTW](https://github.com/JuliaMath/FFTWBuilder/pull/1) should be indicative of something similar to be done for the openblas build.

IIRC you couldn’t use @threads inside @threads before 1.3 so the functions using threads are not composable. 1.3 solves it. Also, the BLAS issue will be fixed at some point. In that case, you’d really want the latest Julia. I think “always use the latest release” is a good approach when it comes to multi-threading in Julia (or maybe actually for anything).

PS: thanks for benchmarking Transducers

cgeoga · December 17, 2019, 11:27pm

Interesting, thank you for the issue references. I have set export OPENBLAS_NUM_THREADS=1 in my bashrc, which I think Julia does see. So that shouldn’t be the issue. But from what you’re suggesting it sounds like I’ve underestimated the communication cost for multiple workers.

Is there any rule of thumb to be had with regard to the number of threads/workers and stuff? For example, my computer has a dual-core CPU and each core seems to have two threads (?). So I might naively think that JULIA_NUM_THREADS=2 julia -p 2 would be a reasonable default instance if I wanted to give Julia 100% of my compute resources. But that isn’t as efficient as NUM_THREADS=4 with only one worker.

Also, with regard to Transducers: thank you for writing it! Does it make sense to file a bug/issue or something for dcollect being significantly slower than pmap? Or is that expected behavior?

PetrKryslUCSD · December 17, 2019, 11:31pm

The threading scheduler changed in 1.3. Check out the news, it is called PARTR.
That is a big change, which will allow threading to eventually be robust with respect
to the various levels of parallelism (user versus library). So, I will definitely upgrade if I were you.

tkf · December 17, 2019, 11:53pm

I probably should have said “threading support” rather than “multi-threading.” IIUC, OpenBLAS has memory management guarded by a lock, which can become a problem if you are using it from multiple Julia thread Does the number of threads affect the executed code? · Issue #34102 · JuliaLang/julia · GitHub. But I agree that’s a useful workaround if you are using it with Distributed.jl.

Your CPU probably has hyper-threading. If your program is compute-bound, it’s better to set JULIA_NUM_THREADS to number of physical cores. Check it with tools like lscpu.

I don’t think mixing process-based and thread-based parallelism is a good approach unless you use Distributed.jl with multiple machines.

Yes, it may be possible to improve dcollect so bug reports (ideally with MWE) are appreciated. But I’d say it’s expected given a long history of Distributed.jl.

cgeoga · December 18, 2019, 4:08am

That is very helpful—thank you @tkf! And @PetrKryslUCSD, I will definitely look into that. Looks like it’s time to build v1.3 myself and do some exploring.

Thanks to everybody for commenting. I think I have an at least slightly better gestalt about parallel coding now.

tkf · December 18, 2019, 4:38am

Why not just download the generic Linux binary from https://julialang.org/downloads/?

Topic		Replies	Views
Simple Parallel Examples for Embarrassingly Simple Problems Julia at Scale	29	7346	April 23, 2021
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2629	December 1, 2021
Questions about getting started with parallel computing Julia at Scale	18	5779	June 22, 2019
Why doesn't multithreading help here? Performance	12	1415	August 22, 2020
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1960	January 20, 2023

Writing effective parallel code

Related topics