Thread affinitization: pinning Julia threads to cores

Hi all,
as part of my attempt to create a simple port of the STREAM benchmark (STREAMBenchmark.jl) for estimating the memory bandwidth of a machine I asked myself whether it is possible to pin Julia threads to specific cores. I’m mostly focused on linux but macOS would be great as well (though I doubt it is possible on the latter).

(I know that I can pin tasks to thread using, for example, @tspawnat 4 Threads.threadid() from ThreadPools.jl. Here I am concerned about pinning threads to cores.)

For C / OpenMP I can use something like likwid-pin to, say, pin thread to cores 0,2,4,6:

likwid-pin -c 0,2,4,6  ./myApp
  • Can I do something similar to pin Julia threads?
  • Is there a way in Julia to figure out which core a thread is running on? (instead of watching htop)

On Slack, @Elrod has pointed me to JULIA_EXCLUSIVE which seems to allow me to change the affinitization behaviour in some way. However, it doesn’t seem to be as fine-grained and I’m not yet sure what it does (will try to find out and report back - I guess that it pins threads to cores 1:nthreads()). Also, according to juliahub.com it seems that there is only a single Julia package in the ecosystem which uses it: Circo/circonode.sh at v0.1.1 · Circo-dev/Circo · GitHub.

Any help in answering the questions above is very much appreciated!

2 Likes

JULIA_EXCLUSIVE is an environmental variable.
Just start Julia:

> JULIA_EXCLUSIVE=1 julia -t8

So packages aren’t really the correct place to “use” it, just like packages aren’t setting -t8 or defining JULIA_NUM_THREADS for you.

2 Likes

Wouldn’t it be enough to get the thread/cpu affinity (with e.g. sched_getcpu() under Linux) and then execute the tasks on the appropriate threads?

1 Like

And that one only uses it currently to avoid a Julia crash in (I guess) error logging when a stacktrace is attached, which I was not yet able to localize down to a reportable mwe…

Alright, I’ve done some testing and had some discussion on Slack/GitHub. Let me share my findings.

1) Query the core id of a thread.

Let’s start with the second question of the OP first:

Is there a way in Julia to figure out which core a thread is running on?

Thanks @pbayer for the pointer to schedule_getcpu(). We can call it in Julia like so:

glibc_coreid() = @ccall sched_getcpu()::Cint

and query the core id of a specific thread using ThreadPools’ @tspawnat:

using ThreadPools
tglibc_coreid(i::Integer) = fetch(@tspawnat i glibc_coreid());

Running the following script on a cluster node,

using ThreadPools
using Base.Threads: nthreads

glibc_coreid() = @ccall sched_getcpu()::Cint
tglibc_coreid(i::Integer) = fetch(@tspawnat i glibc_coreid());

for i in 1:nthreads()
    println("Running on thread $i (glibc_coreid: $(tglibc_coreid(i)))")
end

I get

$ julia -t10 threads_cpuids_glibc.jl
Running on thread 1 (glibc_coreid: 0)
Running on thread 2 (glibc_coreid: 4)
Running on thread 3 (glibc_coreid: 3)
Running on thread 4 (glibc_coreid: 6)
Running on thread 5 (glibc_coreid: 5)
Running on thread 6 (glibc_coreid: 8)
Running on thread 7 (glibc_coreid: 7)
Running on thread 8 (glibc_coreid: 10)
Running on thread 9 (glibc_coreid: 9)
Running on thread 10 (glibc_coreid: 12)

I confirmed with random computations and htop that these core ids are actually correct. Great!

What about macOS (and Windows)?

Note that while sched_getcpu() is available on linux it isn’t on macOS (and neither on windows?). Looking for a pendant, I found this SO thread which mentioned that it should be possible using the cpuid machine instruction which is wrapped in CpuId.jl. We are currently trying to make it work, see CpuId-based sched_getcpu pendant for macOS · Issue #46 · m-j-w/CpuId.jl · GitHub.

2) Pinning threads to specific cores

Using the script from above (and htop as a crosscheck) I can confirm that JULIA_EXCLUSIVE=1 forces Julia to put the threads on core ids 1:nthreads():

$ JULIA_EXCLUSIVE=1 julia -t10 threads_cpuids_glibc.jl
Running on thread 1 (glibc_coreid: 0)
Running on thread 2 (glibc_coreid: 1)
Running on thread 3 (glibc_coreid: 2)
Running on thread 4 (glibc_coreid: 3)
Running on thread 5 (glibc_coreid: 4)
Running on thread 6 (glibc_coreid: 5)
Running on thread 7 (glibc_coreid: 6)
Running on thread 8 (glibc_coreid: 7)
Running on thread 9 (glibc_coreid: 8)
Running on thread 10 (glibc_coreid: 9)

But what about choosing other cores? I tried using numactl --physcpubind first:

$ numactl --physcpubind=3,5,7,12 julia -t4 threads_cpuids_glibc.jl
Running on thread 1 (glibc_coreid: 3)
Running on thread 2 (glibc_coreid: 12)
Running on thread 3 (glibc_coreid: 5)
Running on thread 4 (glibc_coreid: 12)

Note that the threads indeed run on cores from the given list. However, two threads happen to run on the same core. Trying this multiple times I can see no clear pattern here: the thread → cpuid mapping is varying and also which core (if any) hosts more than one thread. So my takeaway is that numactl only allows us to restrict the Julia threads to a specific domain of cores.

I also tried likwid-pin -c. Strangely, I had to specify one more core id than Julia threads to prevent “Roundrobin placement triggered” message (which almost always indicate that something is wrong). I found:

$ likwid-pin -c 0,9,14,32,76 julia -t4 threads_cpuids_glibc.jl
[pthread wrapper]
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->9  1->14  2->32  3->76
[pthread wrapper] SKIP MASK: 0x0
	threadid 22624924616448 -> hwthread 9 - OK
	threadid 22624673154816 -> hwthread 14 - OK
	threadid 22624657823488 -> hwthread 32 - OK
	threadid 22624642492160 -> hwthread 76 - OK
Running on thread 1 (glibc_coreid: 0)
Running on thread 2 (glibc_coreid: 14)
Running on thread 3 (glibc_coreid: 32)
Running on thread 4 (glibc_coreid: 76)

That’s almost what we want! However, it’s odd that we have to provide one more cpu id and that the second id isn’t used. Trying one more time:

$ likwid-pin -c 0,40,41,42,43,44,45,46,47,48,49,50 julia -t10 threads_cpuids_glibc.jl
[pthread wrapper]
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->40  1->41  2->42  3->43  4->44  5->45  6->46  7->47  8->48  9->49  10->50
[pthread wrapper] SKIP MASK: 0x0
	threadid 23027507218176 -> hwthread 40 - OK
	threadid 23027249415936 -> hwthread 41 - OK
	threadid 23027234084608 -> hwthread 42 - OK
	threadid 23027218753280 -> hwthread 43 - OK
	threadid 23027203421952 -> hwthread 44 - OK
	threadid 23026984806144 -> hwthread 45 - OK
	threadid 23026970117888 -> hwthread 46 - OK
	threadid 23026955429632 -> hwthread 47 - OK
	threadid 23026940741376 -> hwthread 48 - OK
	threadid 23026933802752 -> hwthread 49 - OK
Running on thread 1 (glibc_coreid: 0)
Running on thread 2 (glibc_coreid: 41)
Running on thread 3 (glibc_coreid: 42)
Running on thread 4 (glibc_coreid: 43)
Running on thread 5 (glibc_coreid: 44)
Running on thread 6 (glibc_coreid: 45)
Running on thread 7 (glibc_coreid: 46)
Running on thread 8 (glibc_coreid: 47)
Running on thread 9 (glibc_coreid: 48)
Running on thread 10 (glibc_coreid: 49)

Seem to be consistent, but probably needs a bit more testing across different architectures. (I had tested this yesterday as well and I thought that I had multiple thread on the same core here as well… but maybe I’m misremembering.)

Note that JULIA_EXCLUSIVE=1 overwrites both numctl and likwid-pin and puts Julia’s threads on cores 1:nthreads() irrespective of the provided cpu id list:

$ JULIA_EXCLUSIVE=1 numactl --physcpubind=9,14,32,76 julia -t4 threads_cpuids_glibc.jl
Running on thread 1 (glibc_coreid: 0)
Running on thread 2 (glibc_coreid: 1)
Running on thread 3 (glibc_coreid: 2)
Running on thread 4 (glibc_coreid: 3)

$ JULIA_EXCLUSIVE=1 likwid-pin -c 9,14,32,76,77 julia -t4 threads_cpuids_glibc.jl
[pthread wrapper]
[pthread wrapper] MAIN -> 9
[pthread wrapper] PIN_MASK: 0->14  1->32  2->76  3->77
[pthread wrapper] SKIP MASK: 0x0
	threadid 22926639785728 -> hwthread 14 - OK
	threadid 22926388324096 -> hwthread 32 - OK
	threadid 22926372992768 -> hwthread 76 - OK
	threadid 22926357661440 -> hwthread 77 - OK
Running on thread 1 (glibc_coreid: 0)
Running on thread 2 (glibc_coreid: 1)
Running on thread 3 (glibc_coreid: 2)
Running on thread 4 (glibc_coreid: 3)

What about macOS (and Windows)?

Probably no chance? Both numactl and likwid-pin are only available on linux (please correct me if I’m wrong / if there are alternatives or workarounds).

(cc @Elrod, @vchuravy)

6 Likes

Since JULIA_EXCLUSIVE somehow manages to force thread pinning to the first nthreads() cores, maybe there is a way to use the same technique (I have no clue how it works under the hood) to allow for something like JULIA_EXCLUSIVE=1,4,7,32 where 1,4,7,32 are core ids?

1 Like

To answer your question I did on my Linux:

gdb --args julia -e "exit()"
(gdb) b pthread_create
Breakpoint 1, 0x00007ffff77fb560 in pthread_create@@GLIBC_2.2.5 ()
   from /usr/bin/../lib/libpthread.so.0

and then looked at the backtraces of all the location were we are creating threads during startup.

  • 1x pthread_create in src/signals-unix.c
  • 7x pthread_ctreate in blas_thread_init () from /usr/bin/../lib/julia/libopenblas64_.so

Starting Julia with -t 4 adds another three calls in jl_start_threads.

Setting OPENBLAS_NUM_THREADS=1 as an environment removes the additional threads created by OpenBLAS. Leaving you with the main thread, the signal thread and the -t N N-1 worker threads.

I suspect that all of the thread pinning libraries like numactl treats all of them equally, which is why you see two Julia threads being pinned to the same process, since OpenBLAS in the mix oversubscribes the cores. There was some initial thinking about having OpenBLAS use the Julia thread pool partr thread support for openblas · Issue #32786 · JuliaLang/julia · GitHub, but that hasn’t been implemented yet.

2 Likes

Using LIKWID you can provide a skip-mask:

function mask(N)
         mask = UInt(0)
         for i in 1:N
           mask |= 1<<i
         end
         ~mask # Invert the mask to only pin Julia threads
       end
vchuravy@odin ~> likwid-pin -s 0xffffffffffffffe1 -c 1,3,5,7 julia -t4
[pthread wrapper] 
[pthread wrapper] MAIN -> 1
[pthread wrapper] PIN_MASK: 0->3  1->5  2->7  
[pthread wrapper] SKIP MASK: 0xFFFFFFFFFFFFFFE1
	threadid 140618166851136 -> SKIP 
	threadid 140617906632256 -> hwthread 3 - OK
	threadid 140617884972608 -> hwthread 5 - OK
	threadid 140617863349824 -> hwthread 7 - OK

julia> using Base.Threads

julia> glibc_coreid() = @ccall sched_getcpu()::Cint
glibc_coreid (generic function with 1 method)

julia> @threads for i in 1:nthreads()
         @show (i, glibc_coreid())
       end
(i, glibc_coreid()) = (1, 1)
(i, glibc_coreid()) = (2, 3)
(i, glibc_coreid()) = (3, 5)
(i, glibc_coreid()) = (4, 7)
3 Likes

Just in case someone finds this thread: complementing the options above I’ve created ThreadPinning.jl which provides a function pinthreads that you can use to dynamically pin threads to specific cores. (Only works on Linux.)

3 Likes

I’ve been trying around with ThreadPinning.jl related to this post here Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? - #30 by fgerick .

I have some trouble understanding this behaviour:

#julia -t 48 
julia> using BenchmarkTools, Hwloc, ThreadPinning, LinearAlgebra, Octavian; A = rand(5_000,5_000); B = similar(A);

julia> BLAS.get_num_threads()
96

julia> BLAS.set_num_threads(48)

julia> @btime mul!($B, $A, $A);
  626.977 ms (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(24)

julia> @btime mul!($B, $A, $A);
  377.458 ms (0 allocations: 0 bytes)

julia> @btime matmul!($B, $A, $A);
  521.579 ms (0 allocations: 0 bytes)

julia> pinthreads(:compact)

julia> @btime matmul!($B, $A, $A);
  223.912 ms (0 allocations: 0 bytes)

julia> @btime matmul!($B, $A, $A);
  233.511 ms (0 allocations: 0 bytes)

julia> pinthreads(:compact)

julia> @btime matmul!($B, $A, $A);
  188.991 ms (0 allocations: 0 bytes)

julia> @btime matmul!($B, $A, $A);
  186.621 ms (0 allocations: 0 bytes)

julia> @btime mul!($B, $A, $A);
  380.789 ms (0 allocations: 0 bytes)

The OpenMP threads are not affected by ThreadPinning as I understand. However, why do I see different benchmarks for the pure Julia code in Octavian’s matmul! after pinthreads(:compact) the first time and the second time? Is there a way to start Julia on just one socket, without having to pin afterwards?

I don’t have a quick answer but you could use threadinfo() before and after the benchmark to see if the pinning of the Julia threads has changed during the benchmark (for some reason).