Matrix multiplication is slower when multithreading in Julia

While this is true (at least I think so as well), I guess the interesting part is what happens for BLAS.set_num_threads(N) where N>1. And based on a simple test I just ran, OpenBLAS (default) and MKL behave very differently here!

Test function:

function f()
    X = rand(1000,1000);
    Y = zeros(Threads.nthreads())
    Threads.@threads for i in 1:Threads.nthreads()
        Y[i] = sum(X * X)
    end
    return Y
end

MKL:

# 4 Julia threads

julia> BLAS.set_num_threads(1)

julia> @btime f(); # ~400% CPU usage
  32.890 ms (39 allocations: 45.78 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime f(); # ~800% CPU usage
  18.541 ms (39 allocations: 45.78 MiB)

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libmkl_rt.so

OpenBLAS:

# 4 Julia threads

julia> BLAS.set_num_threads(1)

julia> @btime f(); # ~400% CPU usage
  36.527 ms (39 allocations: 45.78 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime f(); # ~240% CPU usage
  90.159 ms (39 allocations: 45.78 MiB)

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.so

For for OpenBLAS, BLAS.set_num_threads(2) seems to control the total number of threads that BLAS uses. So increasing N in BLAS.set_num_threads(N) from N=1 to N=2 might actually lead to the computation running on fewer threads. IMHO, that’s very counterintuitive.

For MKL on the other hand, BLAS.set_num_threads(2) makes the BLAS computation use 2 * (# of Julia threads) many threads.

From the above I’d conclude the following. Let’s say a machine has C cores (no hyperthreading) and we have J Julia threads. What do we have to choose for N in BLAS.set_num_threads(N) to utilize all cores? For OpenBLAS, the answer seems to be N=C whereas for MKL you’d want to set N=C/J.

I could nicely confirm this for MKL where on a C=40 core machine I did set J=5 and N=8 to obtain

julia> @btime f(); # ~4000% CPU usage
  8.076 ms (39 allocations: 45.78 MiB) 

Note that BLAS.set_num_threads(C), which strangely enough seems to be the default (why?!?), is understandably suboptimal:

julia> @btime f();
  19.382 ms (39 allocations: 45.78 MiB)

julia> BLAS.get_num_threads()
40

For OpenBLAS, it’s not as clean since N=C=40 only gave me

julia> @btime f(); # ~3000% CPU usage
  17.151 ms (39 allocations: 45.78 MiB)

(Note that going to N=45 didn’t improve the CPU usage.)

I guess the conclusion is: Set BLAS.set_num_threads(1) unless you know what you / your BLAS library is doing!

PS: Instead of running simple experiments it would of course make sense to read the OpenBLAS / MKL documentation on this matter. But experiments are just more fun :smiley:

9 Likes