Julia Thread Affinity not persistent when calling MKL function

I observe a subtle issue where the pinning of Julia threads to specific cores is spoiled massively by running a seemingly harmless computation. By spoiled I mean that after running the computation all threads are pinned to the same core(!) which is, of course, horrible for performance for everything that follows.

MWE

using Base.Threads: @threads, nthreads
using MKL # comment out -> no issue
using LinearAlgebra

# helper functions
sched_getcpu() = Int(@ccall sched_getcpu()::Cint)
function getcpuids()
    nt = nthreads()
    cpuids = zeros(Int, nt)
    @threads :static for tid in 1:nt
        cpuids[tid] = sched_getcpu()
    end
    return cpuids
end

# computation
function computation()
    @threads :static for t in 1:nthreads()
        X = rand(50, 50)
        # X = rand(5, 5) # uncomment -> no issue
        Y = inv(X) # comment out -> no issue
    end
    return nothing
end

# test loop
for i in 1:2
    println("CPUIDs (before): ", getcpuids())
    computation()
    println("CPUIDs (after): ", getcpuids(), " \n")
end

Pinning the threads in a compact manner by using JULIA_EXCLUSIVE=1 (or, alternatively, ThreadPinning.jl) I obtain the following output (for 10 threads)

CPUIDs (before): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
CPUIDs (after): [3, 3, 3, 3, 3, 3, 3, 3, 3, 3] 

CPUIDs (before): [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
CPUIDs (after): [3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Note that the issue goes away if we either

  • comment out using MKL or
  • comment out Y = inv(X), i.e. no BLAS call, or
  • uncomment the line X = rand(5,5), i.e. consider a smaller matrix X

Also note that if we re-pin the threads before each iteration (using pinthreads(:compact) from ThreadPinning.jl) we obtain

CPUIDs (before): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
CPUIDs (after): [8, 8, 8, 8, 8, 8, 8, 8, 8, 8] 

CPUIDs (before): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
CPUIDs (after): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

So only the first call to computation seems to spoil the pinning.

My suspicion is that this is (somehow) related to MKL, perhaps some kind of initialisation which only happens on call? But maybe I’m wrong. Anyways this seems like a very subtle issue that I’d like to understand better and, ideally, fix somehow!

Any ideas / suggestions would be very much appreciated!

Best,
Carsten

(@tkf, @vchuravy)

1 Like

Mentioned by @vchuravy on Slack: https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-set-affinity-of-threads-spawned-by-MKL/td-p/1026152

With MKL_DYNAMIC=false and MKL_NUM_THREADS=1 (or, alternatively, BLAS.set_num_threads(1)) I get the desired behavior

CPUIDs (before): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
CPUIDs (after): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

CPUIDs (before): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
CPUIDs (after): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]