Julia SLURM + BLAS + Multithreading, threads not mapping well leading to poor performance

I am running benchmarking of a code that uses:

multithreading
BLAS (multithreaded)
BLAS (single threaded in parallel on Julia Threads)
on NERSC Perlmutter (AMD 2x [AMD EPYC 7763] 64 cores x 2 hyperthreads per core) (CRAYMPICH)

From my testing, I am seeing that when run a job:

salloc --nodes 1 --qos interactive --time 04:00:00 --constraint "cpu" --account=$ACCT_NUM  --hint=nomultithread --ntasks-per-node=2 --cpus-per-task=64  --exclusive

When I use ThreadPinning.jl to check the CPUs that are being used:

using ThreadPinning
println(ThreadPinning.getcpuids())

The threads returned are a mix of the hardware threads (0-63) and hyper threads (64-127).

When I use, for example


if rank == 0
    pinthreads(:affinitymask)
    # pinthreads(0:63) #has the same effect
else
    pinthreads(:affinitymask)
    # pinthreads(64:127) #has the same effect
end
.... more code

Some portions of the code show better performance, but not all.

My real question is:

Have others encountered performance issues when running Julia threads in a SLURM environment?

If it’s helpful, I can try to code up a small script that displays the various types of operations being performed. But in general, the main bottlenecks are some multithreaded BLAS calls (i.e. BLAS.set_num_threads(64)) and some threaded loops which call single-threaded BLAS (BLAS.set_num_threads(1))

I suspect @carstenbauer would be the most knowledgeable (also thank you for all your wonderful work on ThreadPinning.jl and the documentation you have on your wiki)

Can you elaborate which portion does better and which doesn’t? Does completely disabling BLAS parallelism help? Have you seen Pinning BLAS Threads · ThreadPinning.jl?

I did play with pinning the BLAS threads a bit but didn’t see a noticeable improvement.

It was the multithreaded BLAS calls that don’t improve if I recall correctly. Removing all BLAS multithreading doesn’t help, and isn’t a viable solution because I rely it for a couple of matrix / LAPACK operations. I will play with that more and report what I see though.

I think to get to the bottom of this I will have to post a example script that replicates what I am seeing to separate the complexity of the code from the essentials of the bottleneck portions that can be replicated in spirit pretty easily.