Hi everyone,
Problem
I am trying to understand how to optimally set the number of BLAS threads when I am using a multithreaded LAPACK function in a Threads.@threads
loop.
More specifically I want to compute the eigendecomposition of a number of independent matrices in parallel as quickly as possible.
System and Setup
My system:
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.0.0)
CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
and I compiled Julia with MKL.
The function I am calling to compute the eigenvalues and eigenvectors in this MWE is syevr
which is multi-threaded in MKL:
function compute_ev_threads(A::Array{Array{Float64, 2}, 1})
Threads.@threads for k = 1:length(A)
W, Z = LinearAlgebra.LAPACK.syevr!('V', 'A', 'U', A[k], 0., 0., 0, 0, -1.0)
end
return nothing
end
Here A
is a list of matrices typically with a dimension in the range of 2 - 1000
. But for a specific problem A
could hold matrices of very different sizes. The length of A
can be up to 1000
.
Tuning knobs
- It seems to me that MKL dynamically chooses the number of threads that it uses for the decomposition, i.e. 1 thread for smaller matrices and more threads for large matrices. That behaviour can be disabled with the environment variable
ENV["MKL_DYNAMIC"] = false
. - I can set the maximum number of threads for BLAS with
BLAS.set_num_threads(nthreads)
. Again ifENV["MKL_DYNAMIC"] = true
might decide to use less than that number. -
JULIA_NUM_THREADS
can be used to tell Julia how many threads I want to use in theThreads.@threads
loop.
Benchmarks
Setup 1: 50 matrices of dimension 200x200
I am setting JULIA_NUM_THREADS = 4
(number of physical cores)
using LinearAlgebra, BenchmarkTools, Random
ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
BLAS.set_num_threads(nthreads);
rng = Random.MersenneTwister(1);
N = 200;
A = [rand(rng, N, N) for k = 1:50];
t = @belapsed compute_ev_threads($A)
println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
Number of BLAS threads: 1, Time: 0.052786899
Number of BLAS threads: 2, Time: 0.058757388
Number of BLAS threads: 4, Time: 0.115723637
Here it seems like 1 BLAS thread works best. Intuitively I would say this makes sense as each Julia thread can create 1 BLAS thread and work in parallel. If each Julia thread would try to use 4 BLAS threads the CPU threads are oversubscribed and the performance goes down.
Question 1: Does this reasoning make sense? Is BLAS.set_num_threads()
setting the global number of BLAS threads that can be used at any one time or per Julia thread? The discussion in Julia Threads vs BLAS Threads seems to say that the setting sets the global thread pool for BLAS. However, this seems to contradict above results.
Setting ENV["MKL_DYNAMIC"] = true
doestn’t seem to have any impact. I get:
Number of BLAS threads: 1, Time: 0.05300575
Number of BLAS threads: 2, Time: 0.059146294
Number of BLAS threads: 4, Time: 0.116660464
Setup 2: 100 matrices of dimension 8x8 and one 500x500
ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
BLAS.set_num_threads(nthreads);
rng = Random.MersenneTwister(1);
N = 8;
A = [rand(rng, N, N) for k = 1:100]
A = pushfirst!(A, rand(rng, 500, 500))
t = @belapsed compute_ev_threads($A)
println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
Number of BLAS threads: 1, Time: 0.047233419
Number of BLAS threads: 2, Time: 0.041688711
Number of BLAS threads: 4, Time: 0.041307225
and with ENV["MKL_DYNAMIC"] = true
. Again doesn’t seem to have any impact:
Number of BLAS threads: 1, Time: 0.048276108
Number of BLAS threads: 2, Time: 0.041320485
Number of BLAS threads: 4, Time: 0.040560082
Question 2: I am surprised that the results are so similar. I would have assumed that dynamic 4 threads would perform much better. First attack the 500x500 block with 4 threads and then dynamically do all the other blocks with 1 thread.
Setup 3: 12 matrices of dimension 500x500
ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
BLAS.set_num_threads(nthreads);
rng = Random.MersenneTwister(1);
N = 500;
A = [rand(rng, N, N) for k = 1:12]
t = @belapsed compute_ev_threads($A)
println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
Number of BLAS threads: 1, Time: 0.110499758
Number of BLAS threads: 2, Time: 0.126500489
Number of BLAS threads: 4, Time: 0.168587943
No impact of ENV["MKL_DYNAMIC"] = true
?:
Number of BLAS threads: 1, Time: 0.110146218
Number of BLAS threads: 2, Time: 0.124481478
Number of BLAS threads: 4, Time: 0.171753869
Question 3: I guess this post is more to double-check that my assumptions are correct and that I am using the different available settings correctly. Also, it seems strange that ENV["MKL_DYNAMIC"] = true
doesn’t really have any impact. I also assumed the overall difference between the results would be bigger.