Hi everyone,

## Problem

I am trying to understand how to optimally set the number of BLAS threads when I am using a multithreaded LAPACK function in a `Threads.@threads` loop.

More specifically I want to compute the eigendecomposition of a number of independent matrices in parallel as quickly as possible.

## System and Setup

My system:

``````Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.0.0)
CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
``````

and I compiled Julia with MKL.

The function I am calling to compute the eigenvalues and eigenvectors in this MWE is `syevr` which is multi-threaded in MKL:

``````function compute_ev_threads(A::Array{Array{Float64, 2}, 1})

W, Z = LinearAlgebra.LAPACK.syevr!('V', 'A', 'U', A[k], 0., 0., 0, 0, -1.0)
end
return nothing
end
``````

Here `A` is a list of matrices typically with a dimension in the range of `2 - 1000`. But for a specific problem `A` could hold matrices of very different sizes. The length of `A` can be up to `1000`.

## Tuning knobs

• It seems to me that MKL dynamically chooses the number of threads that it uses for the decomposition, i.e. 1 thread for smaller matrices and more threads for large matrices. That behaviour can be disabled with the environment variable `ENV["MKL_DYNAMIC"] = false`.
• I can set the maximum number of threads for BLAS with `BLAS.set_num_threads(nthreads)`. Again if `ENV["MKL_DYNAMIC"] = true` might decide to use less than that number.
• `JULIA_NUM_THREADS` can be used to tell Julia how many threads I want to use in the `Threads.@threads` loop.

## Benchmarks

### Setup 1: 50 matrices of dimension 200x200

I am setting `JULIA_NUM_THREADS = 4` (number of physical cores)

``````using LinearAlgebra, BenchmarkTools, Random

ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
rng = Random.MersenneTwister(1);
N = 200;
A = [rand(rng, N, N) for k = 1:50];
t = @belapsed compute_ev_threads(\$A)
println("Number of BLAS threads: \$(nthreads), Time: \$(t)")
end
``````
``````Number of BLAS threads: 1, Time: 0.052786899
Number of BLAS threads: 2, Time: 0.058757388
Number of BLAS threads: 4, Time: 0.115723637
``````

Here it seems like 1 BLAS thread works best. Intuitively I would say this makes sense as each Julia thread can create 1 BLAS thread and work in parallel. If each Julia thread would try to use 4 BLAS threads the CPU threads are oversubscribed and the performance goes down.

Question 1: Does this reasoning make sense? Is `BLAS.set_num_threads()` setting the global number of BLAS threads that can be used at any one time or per Julia thread? The discussion in Julia Threads vs BLAS Threads seems to say that the setting sets the global thread pool for BLAS. However, this seems to contradict above results.

Setting `ENV["MKL_DYNAMIC"] = true` doestn’t seem to have any impact. I get:

``````Number of BLAS threads: 1, Time: 0.05300575
Number of BLAS threads: 2, Time: 0.059146294
Number of BLAS threads: 4, Time: 0.116660464
``````

### Setup 2: 100 matrices of dimension 8x8 and one 500x500

``````ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
rng = Random.MersenneTwister(1);
N = 8;
A = [rand(rng, N, N) for k = 1:100]
A = pushfirst!(A, rand(rng, 500, 500))
t = @belapsed compute_ev_threads(\$A)
println("Number of BLAS threads: \$(nthreads), Time: \$(t)")
end
``````
``````Number of BLAS threads: 1, Time: 0.047233419
Number of BLAS threads: 2, Time: 0.041688711
Number of BLAS threads: 4, Time: 0.041307225
``````

and with ` ENV["MKL_DYNAMIC"] = true`. Again doesn’t seem to have any impact:

``````Number of BLAS threads: 1, Time: 0.048276108
Number of BLAS threads: 2, Time: 0.041320485
Number of BLAS threads: 4, Time: 0.040560082
``````

Question 2: I am surprised that the results are so similar. I would have assumed that dynamic 4 threads would perform much better. First attack the 500x500 block with 4 threads and then dynamically do all the other blocks with 1 thread.

### Setup 3: 12 matrices of dimension 500x500

``````ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
rng = Random.MersenneTwister(1);
N = 500;
A = [rand(rng, N, N) for k = 1:12]
t = @belapsed compute_ev_threads(\$A)
println("Number of BLAS threads: \$(nthreads), Time: \$(t)")
end

``````
``````Number of BLAS threads: 1, Time: 0.110499758
Number of BLAS threads: 2, Time: 0.126500489
Number of BLAS threads: 4, Time: 0.168587943
``````

No impact of `ENV["MKL_DYNAMIC"] = true` ?:

``````Number of BLAS threads: 1, Time: 0.110146218
Number of BLAS threads: 2, Time: 0.124481478
Number of BLAS threads: 4, Time: 0.171753869
``````

Question 3: I guess this post is more to double-check that my assumptions are correct and that I am using the different available settings correctly. Also, it seems strange that `ENV["MKL_DYNAMIC"] = true` doesn’t really have any impact. I also assumed the overall difference between the results would be bigger.

Hi @migarstka,

I was wondering if you had made some progress on this. I’m also getting started with multithreading in Julia, and trying to understand how to scale embarrassingly parallel parts of a code.

The MWE I came up with looks very similar to yours: do a few(`npb`) element-wise multiplication of vectors of size `pb` in-place, as seen below.

``````using Base.Threads
using LinearAlgebra
using BenchmarkTools

const T = Float64

# size of each job
pb = 1024

# number of jobs
npb = 128

iter = [(Vector{T}(undef, pb), rand(T, pb), rand(T, pb)) for i in 1:pb]

function foo(iter)
@threads for (x, y, z) in iter
z .= x .* y
end
end

@btime foo(\$iter)
``````

When I run this with various number of threads

``````for n in 1 2 4 6; do JULIA_NUM_THREADS=\$n julia --check-bounds=no minimal.jl; done
``````

on a 6-Core Intel Core i9 with Julia 1.8.1, I get the following timings

1 1.122 ms 6 (512 bytes)
2 888.008 μs 12 (992 bytes)
4 1.110 ms 25 (1.95 KiB)
6 974.436 μs 37 (2.91 KiB)

Not what I was expecting, given that the CPU usage looks coherent with the number of threads used. I tried a few other things (ThreadsX.foreach and FLoops.@floop), and the results are similar. Hence my question:

1. Am I missing something with Threads?
2. Should I be looking at something else (e.g. DistributedArrays)?

Any tips will be very much appreciated! Thanks,

V.

Didn’t look at the thread but note that BLAS != BLAS. OpenBLAS behaves very differently than MKL. See e.g. Julia Threads + BLAS Threads · ThreadPinning.jl