While this is true (at least I think so as well), I guess the interesting part is what happens for BLAS.set_num_threads(N)
where N>1
. And based on a simple test I just ran, OpenBLAS (default) and MKL behave very differently here!
Test function:
function f()
X = rand(1000,1000);
Y = zeros(Threads.nthreads())
Threads.@threads for i in 1:Threads.nthreads()
Y[i] = sum(X * X)
end
return Y
end
MKL:
# 4 Julia threads
julia> BLAS.set_num_threads(1)
julia> @btime f(); # ~400% CPU usage
32.890 ms (39 allocations: 45.78 MiB)
julia> BLAS.set_num_threads(2)
julia> @btime f(); # ~800% CPU usage
18.541 ms (39 allocations: 45.78 MiB)
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libmkl_rt.so
OpenBLAS:
# 4 Julia threads
julia> BLAS.set_num_threads(1)
julia> @btime f(); # ~400% CPU usage
36.527 ms (39 allocations: 45.78 MiB)
julia> BLAS.set_num_threads(2)
julia> @btime f(); # ~240% CPU usage
90.159 ms (39 allocations: 45.78 MiB)
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.so
For for OpenBLAS, BLAS.set_num_threads(2)
seems to control the total number of threads that BLAS uses. So increasing N
in BLAS.set_num_threads(N)
from N=1
to N=2
might actually lead to the computation running on fewer threads. IMHO, that’s very counterintuitive.
For MKL on the other hand, BLAS.set_num_threads(2)
makes the BLAS computation use 2 * (# of Julia threads)
many threads.
From the above I’d conclude the following. Let’s say a machine has C
cores (no hyperthreading) and we have J
Julia threads. What do we have to choose for N
in BLAS.set_num_threads(N)
to utilize all cores? For OpenBLAS, the answer seems to be N=C
whereas for MKL you’d want to set N=C/J
.
I could nicely confirm this for MKL where on a C=40
core machine I did set J=5
and N=8
to obtain
julia> @btime f(); # ~4000% CPU usage
8.076 ms (39 allocations: 45.78 MiB)
Note that BLAS.set_num_threads(C)
, which strangely enough seems to be the default (why?!?), is understandably suboptimal:
julia> @btime f();
19.382 ms (39 allocations: 45.78 MiB)
julia> BLAS.get_num_threads()
40
For OpenBLAS, it’s not as clean since N=C=40
only gave me
julia> @btime f(); # ~3000% CPU usage
17.151 ms (39 allocations: 45.78 MiB)
(Note that going to N=45
didn’t improve the CPU usage.)
I guess the conclusion is: Set BLAS.set_num_threads(1)
unless you know what you / your BLAS library is doing!
PS: Instead of running simple experiments it would of course make sense to read the OpenBLAS / MKL documentation on this matter. But experiments are just more fun