Why does setting JULIA_EXLCUSIVE=1 make MKL run single-threaded?

I am working on a cluster with 28 cores allocated to me.

I start Julia using

MKL_DYNAMIC="FALSE" JULIA_EXCLUSIVE=1 julia

although I’m uncertain if this is the right approach. Eventually, I aim to run multiple julia threads, each of which performs independent multithreaded eigenvalue calculations (where I’ll reduce the number of threads allocated to MKL based on the number of Julia threads).

I run

julia> using LinearAlgebra, MKL, BenchmarkTools

julia> BLAS.get_num_threads()
1

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  950.672 ms (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  1.780 s (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(28)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  22.887 s (19 allocations: 25.90 MiB)

Evidently, in this case, increasing the number of threads makes everything worse? I see using htop that the eigenvalue calculation is only using one thread, irrespective of what I set.

Similarly, if I launch julia by setting

MKL_NUM_THREADS=28 JULIA_EXCLUSIVE=1 julia

I obtain

julia> using LinearAlgebra, MKL, BenchmarkTools

julia> BLAS.get_num_threads()
28

julia> BLAS.set_num_threads(1)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  954.241 ms (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  1.795 s (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(28)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  23.434 s (19 allocations: 25.90 MiB)

which is again single-threaded. The issue seems to be with JULIA_EXCLUSIVE and not with the MKL variables.

Contrastingly, without setting JULIA_EXCLUSIVE, and launching julia as

MKL_DYNAMIC="FALSE" julia

I obtain

julia> using LinearAlgebra, MKL, BenchmarkTools

julia> BLAS.get_num_threads()
28

julia> BLAS.set_num_threads(1)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  951.961 ms (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  970.861 ms (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(28)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  702.182 ms (19 allocations: 25.90 MiB)

This certainly seems to be working as expected.

Interestingly, I don’t see this behavior if I use ThreadPinning.jl, in which case I find, starting julia as

MKL_DYNAMIC="FALSE" MKL_NUM_THREADS=28 julia

julia> using LinearAlgebra, MKL, BenchmarkTools, ThreadPinning

julia> BLAS.get_num_threads()
28

julia> BLAS.set_num_threads(1)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  951.875 ms (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(2)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  985.169 ms (19 allocations: 25.90 MiB)

julia> BLAS.set_num_threads(28)

julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  684.700 ms (19 allocations: 25.90 MiB)

julia> threadinfo(; blas = true)

| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
  16,17,18,19 |
| 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,
  36,37,38,39 |

# = Julia thread, | = Socket seperator

Julia threads: 1
├ Occupied CPU-threads: 1
└ Mapping (Thread => CPUID): 1 => 4,

BLAS: libmkl_rt.so
├ mkl_get_num_threads: 28
└ mkl_get_dynamic: false


julia> pinthreads(:compact)

julia> threadinfo(; blas = true)

| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
  16,17,18,19 |
| 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,
  36,37,38,39 |

# = Julia thread, | = Socket seperator

Julia threads: 1
├ Occupied CPU-threads: 1
└ Mapping (Thread => CPUID): 1 => 0,

BLAS: libmkl_rt.so
├ mkl_get_num_threads: 28
└ mkl_get_dynamic: false


julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
  684.968 ms (19 allocations: 25.90 MiB)

I’m uncertain if setting JULIA_EXCLUSIVE=1 is equivalent to pinthreads, although the output of threadinfo(; blas = true) seems to be the same.

Launch using:
MKL_DYNAMIC="FALSE" JULIA_EXCLUSIVE=1 MKL_NUM_THREADS=28 julia

and run

julia> using LinearAlgebra, MKL, BenchmarkTools, ThreadPinning

julia> threadinfo(; blas = true)

| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
  16,17,18,19 |
| 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,
  36,37,38,39 |

# = Julia thread, | = Socket seperator

Julia threads: 1
├ Occupied CPU-threads: 1
└ Mapping (Thread => CPUID): 1 => 0,

BLAS: libmkl_rt.so
├ mkl_get_num_threads: 28
└ mkl_get_dynamic: false

This is using MKL v0.5.0, and I see the same behavior on Julia 1.7 and nightly.

So… is JULIA_EXLCUSIVE not something to be used here?