I am working on a cluster with 28 cores allocated to me.
I start Julia using
MKL_DYNAMIC="FALSE" JULIA_EXCLUSIVE=1 julia
although I’m uncertain if this is the right approach. Eventually, I aim to run multiple julia threads, each of which performs independent multithreaded eigenvalue calculations (where I’ll reduce the number of threads allocated to MKL based on the number of Julia threads).
I run
julia> using LinearAlgebra, MKL, BenchmarkTools
julia> BLAS.get_num_threads()
1
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
950.672 ms (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(2)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
1.780 s (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(28)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
22.887 s (19 allocations: 25.90 MiB)
Evidently, in this case, increasing the number of threads makes everything worse? I see using htop
that the eigenvalue calculation is only using one thread, irrespective of what I set.
Similarly, if I launch julia by setting
MKL_NUM_THREADS=28 JULIA_EXCLUSIVE=1 julia
I obtain
julia> using LinearAlgebra, MKL, BenchmarkTools
julia> BLAS.get_num_threads()
28
julia> BLAS.set_num_threads(1)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
954.241 ms (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(2)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
1.795 s (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(28)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
23.434 s (19 allocations: 25.90 MiB)
which is again single-threaded. The issue seems to be with JULIA_EXCLUSIVE
and not with the MKL variables.
Contrastingly, without setting JULIA_EXCLUSIVE
, and launching julia as
MKL_DYNAMIC="FALSE" julia
I obtain
julia> using LinearAlgebra, MKL, BenchmarkTools
julia> BLAS.get_num_threads()
28
julia> BLAS.set_num_threads(1)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
951.961 ms (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(2)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
970.861 ms (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(28)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
702.182 ms (19 allocations: 25.90 MiB)
This certainly seems to be working as expected.
Interestingly, I don’t see this behavior if I use ThreadPinning.jl, in which case I find, starting julia as
MKL_DYNAMIC="FALSE" MKL_NUM_THREADS=28 julia
julia> using LinearAlgebra, MKL, BenchmarkTools, ThreadPinning
julia> BLAS.get_num_threads()
28
julia> BLAS.set_num_threads(1)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
951.875 ms (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(2)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
985.169 ms (19 allocations: 25.90 MiB)
julia> BLAS.set_num_threads(28)
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
684.700 ms (19 allocations: 25.90 MiB)
julia> threadinfo(; blas = true)
| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
16,17,18,19 |
| 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,
36,37,38,39 |
# = Julia thread, | = Socket seperator
Julia threads: 1
├ Occupied CPU-threads: 1
└ Mapping (Thread => CPUID): 1 => 4,
BLAS: libmkl_rt.so
├ mkl_get_num_threads: 28
└ mkl_get_dynamic: false
julia> pinthreads(:compact)
julia> threadinfo(; blas = true)
| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
16,17,18,19 |
| 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,
36,37,38,39 |
# = Julia thread, | = Socket seperator
Julia threads: 1
├ Occupied CPU-threads: 1
└ Mapping (Thread => CPUID): 1 => 0,
BLAS: libmkl_rt.so
├ mkl_get_num_threads: 28
└ mkl_get_dynamic: false
julia> @btime eigen!(A) setup=(A = rand(1000, 1000));
684.968 ms (19 allocations: 25.90 MiB)
I’m uncertain if setting JULIA_EXCLUSIVE=1
is equivalent to pinthreads
, although the output of threadinfo(; blas = true)
seems to be the same.
Launch using:
MKL_DYNAMIC="FALSE" JULIA_EXCLUSIVE=1 MKL_NUM_THREADS=28 julia
and run
julia> using LinearAlgebra, MKL, BenchmarkTools, ThreadPinning
julia> threadinfo(; blas = true)
| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
16,17,18,19 |
| 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,
36,37,38,39 |
# = Julia thread, | = Socket seperator
Julia threads: 1
├ Occupied CPU-threads: 1
└ Mapping (Thread => CPUID): 1 => 0,
BLAS: libmkl_rt.so
├ mkl_get_num_threads: 28
└ mkl_get_dynamic: false
This is using MKL v0.5.0
, and I see the same behavior on Julia 1.7 and nightly.
So… is JULIA_EXLCUSIVE
not something to be used here?