I am running benchmarking of a code that uses:
multithreading
BLAS (multithreaded)
BLAS (single threaded in parallel on Julia Threads)
on NERSC Perlmutter (AMD 2x [AMD EPYC 7763] 64 cores x 2 hyperthreads per core) (CRAYMPICH)
From my testing, I am seeing that when run a job:
salloc --nodes 1 --qos interactive --time 04:00:00 --constraint "cpu" --account=$ACCT_NUM --hint=nomultithread --ntasks-per-node=2 --cpus-per-task=64 --exclusive
When I use ThreadPinning.jl to check the CPUs that are being used:
using ThreadPinning
println(ThreadPinning.getcpuids())
The threads returned are a mix of the hardware threads (0-63) and hyper threads (64-127).
When I use, for example
if rank == 0
pinthreads(:affinitymask)
# pinthreads(0:63) #has the same effect
else
pinthreads(:affinitymask)
# pinthreads(64:127) #has the same effect
end
.... more code
Some portions of the code show better performance, but not all.
My real question is:
Have others encountered performance issues when running Julia threads in a SLURM environment?
If it’s helpful, I can try to code up a small script that displays the various types of operations being performed. But in general, the main bottlenecks are some multithreaded BLAS calls (i.e. BLAS.set_num_threads(64)) and some threaded loops which call single-threaded BLAS (BLAS.set_num_threads(1))
I suspect @carstenbauer would be the most knowledgeable (also thank you for all your wonderful work on ThreadPinning.jl and the documentation you have on your wiki)