Julia Threads vs BLAS threads


#1

I am experimenting with Julia’s (experimental) multithreading feature recently and like the results so far. One of the problems I need to deal with, involves the multiplication of several pairs (say in the order 5 to 50) of matrices, whose size is average (say linear size in the order 10 - 1000). For that problem, there can be a competition between either using Julia threads (to loop over the different pairs) versus using multithreaded matrix multiplication provided by BLAS, and it will depend on the specific problem case (number of pairs and size of the matrices involved) which is most advantageous.

However, it seems that using BLAS.set_num_threads(n) is

  • very inflexible (e.g. how to obtain the current or default number of threads?)
  • very slow (order 350 microseconds, which is much more than the time required to e.g. multiply 100 x 100 matrices).

So it’s not a feasible solution to just modify the number of BLAS threads depending on a quick analysis of the specific case, as that operation itself would take all the time.

As an alternative, I started experimenting with just using BLAS.set_num_threads(1) in the beginning of my script/module, and then using my own multithreaded matrix multiplication

function mymul!(C,A,B)
       (m,n) = size(C)
       mhalf = m>>1
       nhalf = n>>1
       mrange = (1:mhalf,1:mhalf,(mhalf+1):m,(mhalf+1):m)
       nrange = (1:nhalf,(nhalf+1):n,1:nhalf,(nhalf+1):n)
       Threads.@threads for i = 1:4
           mul!(view(C,mrange[i],nrange[i]),view(A,mrange[i],:),view(B,:,nrange[i]))
       end
       return C
end

(here specifically for square matrices and 4 threads, but a slightly more generic strategy can easily be written)

This seems to work surprisingly well, i.e. there is no noticeable difference with the multithreading provided by BLAS. But the advantage is that, if mymul! is called from withing a threaded loop, then it will in itself automatically run single-threaded.

So my question is whether this is something that will need to be considered in Julia Base / LinearAlgebra as the multithreaded features of julia become more established, or whether there are alternative solutions?


#2

What do you mean by that? Did you observe a performance decrease if you are using all cores for threads, and BLAS tries to use more threads to do the matrix multiplication? I’m wondering because I’m doing this a lot and haven’t experimented with the different combinations of BLAS threads vs julia threads :slight_smile:


#3

See https://stackoverflow.com/questions/37501181/how-do-you-get-the-number-of-threads-for-blas-operations-in-julia


#4

Start julia with export JULIA_NUM_THREADS=4

julia> Alist=[randn(100,100) for i = 1:20];
julia> Blist=[randn(100,100) for i = 1:20];
julia> Clist=[zeros(100,100) for i = 1:20];
julia> function manymul!(Clist::Vector{<:Matrix},Alist::Vector{<:Matrix},Blist::Vector{<:Matrix})
       for i = 1:length(Clist)
           A_mul_B!(Clist[i],Alist[i],Blist[i])
       end
       return Clist
       end
julia> function manymul_threaded!(Clist::Vector{<:Matrix},Alist::Vector{<:Matrix},Blist::Vector{<:Matrix})
       Threads.@threads for i = 1:length(Clist)
           A_mul_B!(Clist[i],Alist[i],Blist[i])
       end
       return Clist
       end
julia> using BenchmarkTools
julia> @benchmark manymul!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     568.517 μs (0.00% GC)
  median time:      793.869 μs (0.00% GC)
  mean time:        773.689 μs (0.00% GC)
  maximum time:     5.168 ms (0.00% GC)
  --------------
  samples:          6439
  evals/sample:     1
julia> @benchmark manymul_threaded!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     752.751 μs (0.00% GC)
  median time:      859.177 μs (0.00% GC)
  mean time:        918.596 μs (0.00% GC)
  maximum time:     1.594 ms (0.00% GC)
  --------------
  samples:          5424
  evals/sample:     1
julia> BLAS.set_num_threads(1)

julia> @benchmark manymul!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.115 ms (0.00% GC)
  median time:      1.132 ms (0.00% GC)
  mean time:        1.182 ms (0.00% GC)
  maximum time:     2.263 ms (0.00% GC)
  --------------
  samples:          4222
  evals/sample:     1
julia> @benchmark manymul_threaded!($Clist,$Alist,$Blist)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     345.441 μs (0.00% GC)
  median time:      525.534 μs (0.00% GC)
  mean time:        482.553 μs (0.00% GC)
  maximum time:     11.142 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

So for this case: julia threading over the list (benchmark 4) is favorable to BLAS threading of the individual multiplications (benchmark 1), and combining julia threading with BLAS threading (benchmark 2, leading to hypothetically 4x4 = 16 threads on my quad core) is worse than either 4 or 1.


#5

Really interesting timings! This should probably be somewhere documented.


#6

How did you get this? I see (on Linux x86_64, OpenBLAS)

julia> switcheroo()= begin; BLAS.set_num_threads(1); BLAS.set_num_threads(8); end
switcheroo (generic function with 1 method)

julia> @btime switcheroo()
  50.639 μs (0 allocations: 0 bytes)

#7

No idea, different architecture, os ?

julia> using BenchmarkTools

julia> switcheroo()= begin; BLAS.set_num_threads(1); BLAS.set_num_threads(8); end
switcheroo (generic function with 1 method)

julia> @btime switcheroo()
  409.758 μs (0 allocations: 0 bytes)

julia> @benchmark switcheroo()
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     410.664 μs (0.00% GC)
  median time:      426.849 μs (0.00% GC)
  mean time:        439.550 μs (0.00% GC)
  maximum time:     1.027 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

But even 50 microseconds or 25 is too slow


#8

This is very interesting Thank you so much for showing these timings, I wan’t aware at all. This should definitely be documented somewhere.


#9

Are you both using openblas or perhaps one of you is using inte MKL etc.?


#10

We do want to move to Julia’s own threading as much as possible in the future, but it will take some time to get there. Composability is precisely the main motivation for doing this.


#11

I was wondering what happens with the multithreading w.r.t. 0.7 / 1.0 since it is labeled “experimental” at the moment. Does that mean it is therefore exempt from semver, i.e., an 1.x upgrade can break the current threading API?


#12

We may change the API, but I don’t actually think there will be much need for that. The experimental label is really more because there are some operations (mostly I/O) that will still crash in multithreaded code. There’s been a lot of ongoing work to eliminate all the cases that might crash, but getting 1.0 out the door has somewhat stalled that process. It will continue after 1.0, however. Fortunately, unlike interpreted languages, there’s no fundamental roadblock to full multithreading support, just a bunch of work to be done.


#13

In the test I was using MKL


#14

It looks like setting the number of threads is taking a long time because the logic that determines the BLAS vendor is not optimized away by the compiler:

julia> @btime BLAS.set_num_threads(3)
  27.839 μs (0 allocations: 0 bytes)

But, the following is 5000x faster:

my_BLAS_set_num_threads(n) =
   ccall((:openblas_set_num_threads64_, Base.libblas_name), Cvoid, (Int32,), n)

julia> @btime my_BLAS_set_num_threads(3)
  4.894 ns (0 allocations: 0 bytes)