Julia Threads vs BLAS threads

juthohaegeman · February 8, 2018, 12:30am

I am experimenting with Julia’s (experimental) multithreading feature recently and like the results so far. One of the problems I need to deal with, involves the multiplication of several pairs (say in the order 5 to 50) of matrices, whose size is average (say linear size in the order 10 - 1000). For that problem, there can be a competition between either using Julia threads (to loop over the different pairs) versus using multithreaded matrix multiplication provided by BLAS, and it will depend on the specific problem case (number of pairs and size of the matrices involved) which is most advantageous.

However, it seems that using BLAS.set_num_threads(n) is

very inflexible (e.g. how to obtain the current or default number of threads?)
very slow (order 350 microseconds, which is much more than the time required to e.g. multiply 100 x 100 matrices).

So it’s not a feasible solution to just modify the number of BLAS threads depending on a quick analysis of the specific case, as that operation itself would take all the time.

As an alternative, I started experimenting with just using BLAS.set_num_threads(1) in the beginning of my script/module, and then using my own multithreaded matrix multiplication

function mymul!(C,A,B)
       (m,n) = size(C)
       mhalf = m>>1
       nhalf = n>>1
       mrange = (1:mhalf,1:mhalf,(mhalf+1):m,(mhalf+1):m)
       nrange = (1:nhalf,(nhalf+1):n,1:nhalf,(nhalf+1):n)
       Threads.@threads for i = 1:4
           mul!(view(C,mrange[i],nrange[i]),view(A,mrange[i],:),view(B,:,nrange[i]))
       end
       return C
end

(here specifically for square matrices and 4 threads, but a slightly more generic strategy can easily be written)

This seems to work surprisingly well, i.e. there is no noticeable difference with the multithreading provided by BLAS. But the advantage is that, if mymul! is called from withing a threaded loop, then it will in itself automatically run single-threaded.

So my question is whether this is something that will need to be considered in Julia Base / LinearAlgebra as the multithreaded features of julia become more established, or whether there are alternative solutions?

louisponet · February 8, 2018, 9:43am

What do you mean by that? Did you observe a performance decrease if you are using all cores for threads, and BLAS tries to use more threads to do the matrix multiplication? I’m wondering because I’m doing this a lot and haven’t experimented with the different combinations of BLAS threads vs julia threads

carstenbauer · February 8, 2018, 10:01am

See How do you *get* the number of threads for BLAS operations in Julia? - Stack Overflow

juthohaegeman · February 8, 2018, 1:12pm

Start julia with export JULIA_NUM_THREADS=4

julia> Alist=[randn(100,100) for i = 1:20];
julia> Blist=[randn(100,100) for i = 1:20];
julia> Clist=[zeros(100,100) for i = 1:20];
julia> function manymul!(Clist::Vector{<:Matrix},Alist::Vector{<:Matrix},Blist::Vector{<:Matrix})
       for i = 1:length(Clist)
           A_mul_B!(Clist[i],Alist[i],Blist[i])
       end
       return Clist
       end
julia> function manymul_threaded!(Clist::Vector{<:Matrix},Alist::Vector{<:Matrix},Blist::Vector{<:Matrix})
       Threads.@threads for i = 1:length(Clist)
           A_mul_B!(Clist[i],Alist[i],Blist[i])
       end
       return Clist
       end
julia> using BenchmarkTools
julia> @benchmark manymul!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     568.517 μs (0.00% GC)
  median time:      793.869 μs (0.00% GC)
  mean time:        773.689 μs (0.00% GC)
  maximum time:     5.168 ms (0.00% GC)
  --------------
  samples:          6439
  evals/sample:     1
julia> @benchmark manymul_threaded!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     752.751 μs (0.00% GC)
  median time:      859.177 μs (0.00% GC)
  mean time:        918.596 μs (0.00% GC)
  maximum time:     1.594 ms (0.00% GC)
  --------------
  samples:          5424
  evals/sample:     1
julia> BLAS.set_num_threads(1)

julia> @benchmark manymul!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.115 ms (0.00% GC)
  median time:      1.132 ms (0.00% GC)
  mean time:        1.182 ms (0.00% GC)
  maximum time:     2.263 ms (0.00% GC)
  --------------
  samples:          4222
  evals/sample:     1
julia> @benchmark manymul_threaded!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     345.441 μs (0.00% GC)
  median time:      525.534 μs (0.00% GC)
  mean time:        482.553 μs (0.00% GC)
  maximum time:     11.142 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

So for this case: julia threading over the list (benchmark 4) is favorable to BLAS threading of the individual multiplications (benchmark 1), and combining julia threading with BLAS threading (benchmark 2, leading to hypothetically 4x4 = 16 threads on my quad core) is worse than either 4 or 1.

saschatimme · February 8, 2018, 1:23pm

Really interesting timings! This should probably be somewhere documented.

Ralph_Smith · February 8, 2018, 2:45pm

How did you get this? I see (on Linux x86_64, OpenBLAS)

julia> switcheroo()= begin; BLAS.set_num_threads(1); BLAS.set_num_threads(8); end
switcheroo (generic function with 1 method)

julia> @btime switcheroo()
  50.639 μs (0 allocations: 0 bytes)

juthohaegeman · February 8, 2018, 3:08pm

No idea, different architecture, os ?

julia> using BenchmarkTools

julia> switcheroo()= begin; BLAS.set_num_threads(1); BLAS.set_num_threads(8); end
switcheroo (generic function with 1 method)

julia> @btime switcheroo()
  409.758 μs (0 allocations: 0 bytes)

julia> @benchmark switcheroo()
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     410.664 μs (0.00% GC)
  median time:      426.849 μs (0.00% GC)
  mean time:        439.550 μs (0.00% GC)
  maximum time:     1.027 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

But even 50 microseconds or 25 is too slow

louisponet · February 8, 2018, 4:31pm

This is very interesting Thank you so much for showing these timings, I wan’t aware at all. This should definitely be documented somewhere.

baggepinnen · February 8, 2018, 6:44pm

Are you both using openblas or perhaps one of you is using inte MKL etc.?

StefanKarpinski · February 8, 2018, 7:02pm

We do want to move to Julia’s own threading as much as possible in the future, but it will take some time to get there. Composability is precisely the main motivation for doing this.

saschatimme · February 8, 2018, 7:07pm

I was wondering what happens with the multithreading w.r.t. 0.7 / 1.0 since it is labeled “experimental” at the moment. Does that mean it is therefore exempt from semver, i.e., an 1.x upgrade can break the current threading API?

StefanKarpinski · February 8, 2018, 11:16pm

We may change the API, but I don’t actually think there will be much need for that. The experimental label is really more because there are some operations (mostly I/O) that will still crash in multithreaded code. There’s been a lot of ongoing work to eliminate all the cases that might crash, but getting 1.0 out the door has somewhat stalled that process. It will continue after 1.0, however. Fortunately, unlike interpreted languages, there’s no fundamental roadblock to full multithreading support, just a bunch of work to be done.

louisponet · March 6, 2018, 11:54am

In the test I was using MKL

jlapeyre · May 19, 2018, 11:07pm

It looks like setting the number of threads is taking a long time because the logic that determines the BLAS vendor is not optimized away by the compiler:

julia> @btime BLAS.set_num_threads(3)
  27.839 μs (0 allocations: 0 bytes)

But, the following is 5000x faster:

my_BLAS_set_num_threads(n) =
   ccall((:openblas_set_num_threads64_, Base.libblas_name), Cvoid, (Int32,), n)

julia> @btime my_BLAS_set_num_threads(3)
  4.894 ns (0 allocations: 0 bytes)

alkorang · July 25, 2018, 3:36pm

I was searching for multithreading in Julia, and finally reached this interesting post.

What I found is that OpenBLAS manages its own thread pool and uses it unless we call BLAS.set_num_threads(1). So when we start with JULIA_NUM_THREADS=4 and call BLAS.set_num_threads(4), it’s not 4x4 = 16 but 4 + 4 = 8.

When I run manymul_threaded! benchmark :

top - 00:15:38 up 65 days,  7:55,  6 users,  load average: 5.44, 2.78, 1.43
Threads:   8 total,   7 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 75.3 us, 24.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8004416 total,  3736952 free,  3768084 used,   499380 buff/cache
KiB Swap: 32767868 total, 31780740 free,   987128 used.  3825440 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24866 alkorang  20   0 4002112 2.950g  15396 R 64.5 38.6   4:52.49 julia
24871 alkorang  20   0 4002112 2.950g  15396 R 60.5 38.6   3:44.93 julia
24872 alkorang  20   0 4002112 2.950g  15396 R 59.0 38.6   3:44.58 julia
24868 alkorang  20   0 4002112 2.950g  15396 R 57.4 38.6   0:50.98 julia
24873 alkorang  20   0 4002112 2.950g  15396 R 55.1 38.6   3:44.77 julia
24870 alkorang  20   0 4002112 2.950g  15396 R 51.6 38.6   0:49.21 julia
24869 alkorang  20   0 4002112 2.950g  15396 R 49.2 38.6   0:51.68 julia
24867 alkorang  20   0 4002112 2.950g  15396 S  0.0 38.6   0:00.00 julia

When I run randn(5000, 5000) * randn(5000, 5000); :

top - 00:16:41 up 65 days,  7:56,  6 users,  load average: 4.49, 3.12, 1.65
Threads:   8 total,   4 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.6 us,  0.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  8004416 total,  3537528 free,  3963780 used,   503108 buff/cache
KiB Swap: 32767868 total, 31780752 free,   987116 used.  3627800 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24871 alkorang  20   0 4197428 3.137g  16064 R 99.7 41.1   4:17.85 julia
24872 alkorang  20   0 4197428 3.137g  16064 R 99.7 41.1   4:16.56 julia
24866 alkorang  20   0 4197428 3.137g  16064 R 99.3 41.1   5:23.14 julia
24873 alkorang  20   0 4197428 3.137g  16064 R 98.7 41.1   4:17.31 julia
24867 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   0:00.00 julia
24868 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   1:18.36 julia
24869 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   1:06.98 julia
24870 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   1:06.37 julia

(This comes from top command on Linux, top -H -p <pid>)

This tells us calling a multithreaded function does not mean creating new threads, depending on the implementation.

Similarily, calling BLAS.set_num_threads(2) does not destroy 2 threads from BLAS thread pool. It just deactivates 2 threads from computation.
BLAS.set_num_threads(2); randn(5000) * randn(5000); :

top - 00:43:46 up 65 days,  8:23,  6 users,  load average: 1.03, 0.32, 0.54
Threads:   8 total,   2 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 50.3 us,  0.0 sy,  0.0 ni, 49.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  8004416 total,  3718088 free,  3781652 used,   504676 buff/cache
KiB Swap: 32767868 total, 31789756 free,   978112 used.  3809092 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24866 alkorang  20   0 4002112 2.954g  16112 R 99.9 38.7   8:46.08 julia
24871 alkorang  20   0 4002112 2.954g  16112 R 99.9 38.7   7:39.11 julia
24867 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   0:00.00 julia
24868 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   1:18.36 julia
24869 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   1:06.98 julia
24870 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   1:06.37 julia
24872 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   6:52.61 julia
24873 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   6:53.96 julia

This is because OpenBLAS does not use its thread pool when we set the number of threads to 1, 4x1 = 4 threads.

jlapeyre · July 25, 2018, 4:56pm

The slow setting of the number of BLAS threads has been fixed here #28109, and maybe here #28150, too.

retrosnub · July 26, 2018, 7:29am

this somewhat recent talk about shared memory parallelism in julia might also interest someone, in particular the part about nested parallelism.

Topic		Replies	Views
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2593	April 6, 2021
Matrix vector multiplication Performance question	4	873	September 27, 2020
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	373	November 13, 2023
Operations on small matrices and BLAS in 1.6.0 RC1 New to Julia	2	405	March 2, 2021
Matrix Addition with multithreads! Numerics	10	1763	September 2, 2017

Julia Threads vs BLAS threads

Related topics