Hi everyone,

## Problem

I am trying to understand how to optimally set the number of BLAS threads when I am using a multithreaded **LAPACK** function in a `Threads.@threads`

loop.

More specifically I want to compute the eigendecomposition of a number of independent matrices in parallel as quickly as possible.

## System and Setup

My system:

```
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.0.0)
CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
```

and I compiled **Julia with MKL**.

The function I am calling to compute the eigenvalues and eigenvectors in this MWE is `syevr`

which is multi-threaded in MKL:

```
function compute_ev_threads(A::Array{Array{Float64, 2}, 1})
Threads.@threads for k = 1:length(A)
W, Z = LinearAlgebra.LAPACK.syevr!('V', 'A', 'U', A[k], 0., 0., 0, 0, -1.0)
end
return nothing
end
```

Here `A`

is a list of matrices typically with a dimension in the range of `2 - 1000`

. But for a specific problem `A`

could hold matrices of very different sizes. The length of `A`

can be up to `1000`

.

## Tuning knobs

- It seems to me that MKL dynamically chooses the number of threads that it uses for the decomposition, i.e. 1 thread for smaller matrices and more threads for large matrices. That behaviour can be disabled with the environment variable
`ENV["MKL_DYNAMIC"] = false`

. - I can set the maximum number of threads for BLAS with
`BLAS.set_num_threads(nthreads)`

. Again if`ENV["MKL_DYNAMIC"] = true`

might decide to use less than that number. -
`JULIA_NUM_THREADS`

can be used to tell Julia how many threads I want to use in the`Threads.@threads`

loop.

## Benchmarks

### Setup 1: 50 matrices of dimension 200x200

I am setting `JULIA_NUM_THREADS = 4`

(number of physical cores)

```
using LinearAlgebra, BenchmarkTools, Random
ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
BLAS.set_num_threads(nthreads);
rng = Random.MersenneTwister(1);
N = 200;
A = [rand(rng, N, N) for k = 1:50];
t = @belapsed compute_ev_threads($A)
println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
```

```
Number of BLAS threads: 1, Time: 0.052786899
Number of BLAS threads: 2, Time: 0.058757388
Number of BLAS threads: 4, Time: 0.115723637
```

Here it seems like 1 **BLAS thread** works best. Intuitively I would say this makes sense as each Julia thread can create 1 BLAS thread and work in parallel. If each Julia thread would try to use 4 BLAS threads the CPU threads are oversubscribed and the performance goes down.

**Question 1:** Does this reasoning make sense? Is `BLAS.set_num_threads()`

setting the global number of BLAS threads that can be used at any one time or per Julia thread? The discussion in Julia Threads vs BLAS Threads seems to say that the setting sets the global thread pool for BLAS. However, this seems to contradict above results.

Setting `ENV["MKL_DYNAMIC"] = true`

doestn’t seem to have any impact. I get:

```
Number of BLAS threads: 1, Time: 0.05300575
Number of BLAS threads: 2, Time: 0.059146294
Number of BLAS threads: 4, Time: 0.116660464
```

### Setup 2: 100 matrices of dimension 8x8 and one 500x500

```
ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
BLAS.set_num_threads(nthreads);
rng = Random.MersenneTwister(1);
N = 8;
A = [rand(rng, N, N) for k = 1:100]
A = pushfirst!(A, rand(rng, 500, 500))
t = @belapsed compute_ev_threads($A)
println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
```

```
Number of BLAS threads: 1, Time: 0.047233419
Number of BLAS threads: 2, Time: 0.041688711
Number of BLAS threads: 4, Time: 0.041307225
```

and with ` ENV["MKL_DYNAMIC"] = true`

. Again doesn’t seem to have any impact:

```
Number of BLAS threads: 1, Time: 0.048276108
Number of BLAS threads: 2, Time: 0.041320485
Number of BLAS threads: 4, Time: 0.040560082
```

**Question 2**: I am surprised that the results are so similar. I would have assumed that dynamic 4 threads would perform much better. First attack the 500x500 block with 4 threads and then dynamically do all the other blocks with 1 thread.

### Setup 3: 12 matrices of dimension 500x500

```
ENV["MKL_DYNAMIC"] = false
for nthreads in [1; 2; 4]
BLAS.set_num_threads(nthreads);
rng = Random.MersenneTwister(1);
N = 500;
A = [rand(rng, N, N) for k = 1:12]
t = @belapsed compute_ev_threads($A)
println("Number of BLAS threads: $(nthreads), Time: $(t)")
end
```

```
Number of BLAS threads: 1, Time: 0.110499758
Number of BLAS threads: 2, Time: 0.126500489
Number of BLAS threads: 4, Time: 0.168587943
```

No impact of `ENV["MKL_DYNAMIC"] = true`

?:

```
Number of BLAS threads: 1, Time: 0.110146218
Number of BLAS threads: 2, Time: 0.124481478
Number of BLAS threads: 4, Time: 0.171753869
```

**Question 3**: I guess this post is more to double-check that my assumptions are correct and that I am using the different available settings correctly. Also, it seems strange that `ENV["MKL_DYNAMIC"] = true`

doesn’t really have any impact. I also assumed the overall difference between the results would be bigger.