Julia Threads vs BLAS threads

I am experimenting with Julia’s (experimental) multithreading feature recently and like the results so far. One of the problems I need to deal with, involves the multiplication of several pairs (say in the order 5 to 50) of matrices, whose size is average (say linear size in the order 10 - 1000). For that problem, there can be a competition between either using Julia threads (to loop over the different pairs) versus using multithreaded matrix multiplication provided by BLAS, and it will depend on the specific problem case (number of pairs and size of the matrices involved) which is most advantageous.

However, it seems that using BLAS.set_num_threads(n) is

  • very inflexible (e.g. how to obtain the current or default number of threads?)
  • very slow (order 350 microseconds, which is much more than the time required to e.g. multiply 100 x 100 matrices).

So it’s not a feasible solution to just modify the number of BLAS threads depending on a quick analysis of the specific case, as that operation itself would take all the time.

As an alternative, I started experimenting with just using BLAS.set_num_threads(1) in the beginning of my script/module, and then using my own multithreaded matrix multiplication

function mymul!(C,A,B)
       (m,n) = size(C)
       mhalf = m>>1
       nhalf = n>>1
       mrange = (1:mhalf,1:mhalf,(mhalf+1):m,(mhalf+1):m)
       nrange = (1:nhalf,(nhalf+1):n,1:nhalf,(nhalf+1):n)
       Threads.@threads for i = 1:4
           mul!(view(C,mrange[i],nrange[i]),view(A,mrange[i],:),view(B,:,nrange[i]))
       end
       return C
end

(here specifically for square matrices and 4 threads, but a slightly more generic strategy can easily be written)

This seems to work surprisingly well, i.e. there is no noticeable difference with the multithreading provided by BLAS. But the advantage is that, if mymul! is called from withing a threaded loop, then it will in itself automatically run single-threaded.

So my question is whether this is something that will need to be considered in Julia Base / LinearAlgebra as the multithreaded features of julia become more established, or whether there are alternative solutions?

1 Like

What do you mean by that? Did you observe a performance decrease if you are using all cores for threads, and BLAS tries to use more threads to do the matrix multiplication? I’m wondering because I’m doing this a lot and haven’t experimented with the different combinations of BLAS threads vs julia threads :slight_smile:

See How do you *get* the number of threads for BLAS operations in Julia? - Stack Overflow

Start julia with export JULIA_NUM_THREADS=4

julia> Alist=[randn(100,100) for i = 1:20];
julia> Blist=[randn(100,100) for i = 1:20];
julia> Clist=[zeros(100,100) for i = 1:20];
julia> function manymul!(Clist::Vector{<:Matrix},Alist::Vector{<:Matrix},Blist::Vector{<:Matrix})
       for i = 1:length(Clist)
           A_mul_B!(Clist[i],Alist[i],Blist[i])
       end
       return Clist
       end
julia> function manymul_threaded!(Clist::Vector{<:Matrix},Alist::Vector{<:Matrix},Blist::Vector{<:Matrix})
       Threads.@threads for i = 1:length(Clist)
           A_mul_B!(Clist[i],Alist[i],Blist[i])
       end
       return Clist
       end
julia> using BenchmarkTools
julia> @benchmark manymul!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     568.517 μs (0.00% GC)
  median time:      793.869 μs (0.00% GC)
  mean time:        773.689 μs (0.00% GC)
  maximum time:     5.168 ms (0.00% GC)
  --------------
  samples:          6439
  evals/sample:     1
julia> @benchmark manymul_threaded!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     752.751 μs (0.00% GC)
  median time:      859.177 μs (0.00% GC)
  mean time:        918.596 μs (0.00% GC)
  maximum time:     1.594 ms (0.00% GC)
  --------------
  samples:          5424
  evals/sample:     1
julia> BLAS.set_num_threads(1)

julia> @benchmark manymul!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.115 ms (0.00% GC)
  median time:      1.132 ms (0.00% GC)
  mean time:        1.182 ms (0.00% GC)
  maximum time:     2.263 ms (0.00% GC)
  --------------
  samples:          4222
  evals/sample:     1
julia> @benchmark manymul_threaded!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     345.441 μs (0.00% GC)
  median time:      525.534 μs (0.00% GC)
  mean time:        482.553 μs (0.00% GC)
  maximum time:     11.142 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

So for this case: julia threading over the list (benchmark 4) is favorable to BLAS threading of the individual multiplications (benchmark 1), and combining julia threading with BLAS threading (benchmark 2, leading to hypothetically 4x4 = 16 threads on my quad core) is worse than either 4 or 1.

3 Likes

Really interesting timings! This should probably be somewhere documented.

How did you get this? I see (on Linux x86_64, OpenBLAS)

julia> switcheroo()= begin; BLAS.set_num_threads(1); BLAS.set_num_threads(8); end
switcheroo (generic function with 1 method)

julia> @btime switcheroo()
  50.639 μs (0 allocations: 0 bytes)

No idea, different architecture, os ?

julia> using BenchmarkTools

julia> switcheroo()= begin; BLAS.set_num_threads(1); BLAS.set_num_threads(8); end
switcheroo (generic function with 1 method)

julia> @btime switcheroo()
  409.758 μs (0 allocations: 0 bytes)

julia> @benchmark switcheroo()
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     410.664 μs (0.00% GC)
  median time:      426.849 μs (0.00% GC)
  mean time:        439.550 μs (0.00% GC)
  maximum time:     1.027 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

But even 50 microseconds or 25 is too slow

This is very interesting Thank you so much for showing these timings, I wan’t aware at all. This should definitely be documented somewhere.

Are you both using openblas or perhaps one of you is using inte MKL etc.?

We do want to move to Julia’s own threading as much as possible in the future, but it will take some time to get there. Composability is precisely the main motivation for doing this.

1 Like

I was wondering what happens with the multithreading w.r.t. 0.7 / 1.0 since it is labeled “experimental” at the moment. Does that mean it is therefore exempt from semver, i.e., an 1.x upgrade can break the current threading API?

We may change the API, but I don’t actually think there will be much need for that. The experimental label is really more because there are some operations (mostly I/O) that will still crash in multithreaded code. There’s been a lot of ongoing work to eliminate all the cases that might crash, but getting 1.0 out the door has somewhat stalled that process. It will continue after 1.0, however. Fortunately, unlike interpreted languages, there’s no fundamental roadblock to full multithreading support, just a bunch of work to be done.

6 Likes

In the test I was using MKL

It looks like setting the number of threads is taking a long time because the logic that determines the BLAS vendor is not optimized away by the compiler:

julia> @btime BLAS.set_num_threads(3)
  27.839 μs (0 allocations: 0 bytes)

But, the following is 5000x faster:

my_BLAS_set_num_threads(n) =
   ccall((:openblas_set_num_threads64_, Base.libblas_name), Cvoid, (Int32,), n)

julia> @btime my_BLAS_set_num_threads(3)
  4.894 ns (0 allocations: 0 bytes)
5 Likes

I was searching for multithreading in Julia, and finally reached this interesting post. :slight_smile:

What I found is that OpenBLAS manages its own thread pool and uses it unless we call BLAS.set_num_threads(1). So when we start with JULIA_NUM_THREADS=4 and call BLAS.set_num_threads(4), it’s not 4x4 = 16 but 4 + 4 = 8.

When I run manymul_threaded! benchmark :

top - 00:15:38 up 65 days,  7:55,  6 users,  load average: 5.44, 2.78, 1.43
Threads:   8 total,   7 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 75.3 us, 24.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8004416 total,  3736952 free,  3768084 used,   499380 buff/cache
KiB Swap: 32767868 total, 31780740 free,   987128 used.  3825440 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24866 alkorang  20   0 4002112 2.950g  15396 R 64.5 38.6   4:52.49 julia
24871 alkorang  20   0 4002112 2.950g  15396 R 60.5 38.6   3:44.93 julia
24872 alkorang  20   0 4002112 2.950g  15396 R 59.0 38.6   3:44.58 julia
24868 alkorang  20   0 4002112 2.950g  15396 R 57.4 38.6   0:50.98 julia
24873 alkorang  20   0 4002112 2.950g  15396 R 55.1 38.6   3:44.77 julia
24870 alkorang  20   0 4002112 2.950g  15396 R 51.6 38.6   0:49.21 julia
24869 alkorang  20   0 4002112 2.950g  15396 R 49.2 38.6   0:51.68 julia
24867 alkorang  20   0 4002112 2.950g  15396 S  0.0 38.6   0:00.00 julia

When I run randn(5000, 5000) * randn(5000, 5000); :

top - 00:16:41 up 65 days,  7:56,  6 users,  load average: 4.49, 3.12, 1.65
Threads:   8 total,   4 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.6 us,  0.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  8004416 total,  3537528 free,  3963780 used,   503108 buff/cache
KiB Swap: 32767868 total, 31780752 free,   987116 used.  3627800 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24871 alkorang  20   0 4197428 3.137g  16064 R 99.7 41.1   4:17.85 julia
24872 alkorang  20   0 4197428 3.137g  16064 R 99.7 41.1   4:16.56 julia
24866 alkorang  20   0 4197428 3.137g  16064 R 99.3 41.1   5:23.14 julia
24873 alkorang  20   0 4197428 3.137g  16064 R 98.7 41.1   4:17.31 julia
24867 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   0:00.00 julia
24868 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   1:18.36 julia
24869 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   1:06.98 julia
24870 alkorang  20   0 4197428 3.137g  16064 S  0.0 41.1   1:06.37 julia

(This comes from top command on Linux, top -H -p <pid>)

This tells us calling a multithreaded function does not mean creating new threads, depending on the implementation.

Similarily, calling BLAS.set_num_threads(2) does not destroy 2 threads from BLAS thread pool. It just deactivates 2 threads from computation.
BLAS.set_num_threads(2); randn(5000) * randn(5000); :

top - 00:43:46 up 65 days,  8:23,  6 users,  load average: 1.03, 0.32, 0.54
Threads:   8 total,   2 running,   6 sleeping,   0 stopped,   0 zombie
%Cpu(s): 50.3 us,  0.0 sy,  0.0 ni, 49.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  8004416 total,  3718088 free,  3781652 used,   504676 buff/cache
KiB Swap: 32767868 total, 31789756 free,   978112 used.  3809092 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
24866 alkorang  20   0 4002112 2.954g  16112 R 99.9 38.7   8:46.08 julia
24871 alkorang  20   0 4002112 2.954g  16112 R 99.9 38.7   7:39.11 julia
24867 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   0:00.00 julia
24868 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   1:18.36 julia
24869 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   1:06.98 julia
24870 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   1:06.37 julia
24872 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   6:52.61 julia
24873 alkorang  20   0 4002112 2.954g  16112 S  0.0 38.7   6:53.96 julia

This is because OpenBLAS does not use its thread pool when we set the number of threads to 1, 4x1 = 4 threads.

2 Likes

The slow setting of the number of BLAS threads has been fixed here #28109, and maybe here #28150, too.

this somewhat recent talk about shared memory parallelism in julia might also interest someone, in particular the part about nested parallelism.