Multiple BLISBLAS Threads on Workers

andrew-saydjari · April 9, 2023, 6:05am

I am refining a BLAS intensive pipeline and have loved (and gotten a nice speedup from) BLISBLAS.jl. I happen to be running on a cluster with hyperthreading. Currently, I am running using addprocs(SlurmClusterManager()) across multiple nodes. Reading the ThreadPinning.jl docs,

If OPENBLAS_NUM_THREADS=1, OpenBLAS uses the calling Julia thread(s) to run BLAS computations, i.e. it “reuses” the Julia thread that runs a computation.

First, is that also true for BLIS_NUM_THREADS? And, if I wanted to take advantage of the hyperthreading, would I want to launch each worker with two julia threads, and set BLIS_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1?

Currently, I manually set the worker → cpu mapping manually as below. Any obvious ways to adapt that to the case I am proposing where we have two threads per worker?

getinfo_worker(workerid::Int) = @getfrom workerid myid(), ThreadPinning.sched_getcpu(), gethostname()
idlst = getinfo_worker.(workers()) 
df = DataFrame(workerid=Int[],physcpu=Int[],hostname=String[])
push!(df,idlst...)
gdf = groupby(df,:hostname)
for sgdf in gdf, (sindx, sworker) in enumerate(sgdf.workerid)
    sendto(sworker, sindx=sindx)
    @spawnat sworker ThreadPinning.pinthread(sindx-1)
end

carstenbauer · April 9, 2023, 9:13am

I don’t know, but it would certainly be great to figure this out and add it to the docs you mentioned.

That largely depends on the application specifics. In general, I tend to avoid using hyperthreads. They can give a speed up for certain workloads but they may also have the opposite effect and also add quite a bit of complexity.

IIUC, you would benefit from MPI: Improve "manual" pinning (`pinthreads_mpi`) · Issue #61 · carstenbauer/ThreadPinning.jl · GitHub (if generalized beyond MPI). It’s not high on my priority list but will hopefully happen at some point. Feel free to move this forward yourself if you want to.

andrew-saydjari · April 9, 2023, 9:39pm

Thanks for the reply (and the nice packages). I also tend to avoid using hyperthreads, but I just wanted to do the test to see if it might help in this case.

I tried using julia -t 2 and BLIS_NUM_THREADS=1 and BLIS_NUM_THREADS=2 and saw no change on htop or timing compared to running julia single threaded. I also did this via BLIS.set_num_threads() with the same result (despite the fact that I could clearly get_num_threads() to be the value that I set).

I was running the following test which is close to the limiting step in my problem. I also tried (5000x8000) in case the size was too small to trigger BLISBLAS from using an additional available thread (again, without any change in behavior).

rng = MersenneTwister(203);
A = randn(rng,50,8000);
x = randn(rng,8000);

function testMult(A,x)
    A*x
    return
end

b = @benchmark testMult($A,$x)

I am very doubtful that this will be helpful to my application since the comparison with multithreading/not multithreading with OpenBLAS on this test problem was much worse. However, I wanted to learn how to pin the threads on workers manually across nodes… and so I will just put my solution here (which is VERY manual), in case it helps someone else.

getinfo_worker(workerid::Int) = @getfrom workerid myid(), ThreadPinning.sched_getcpu(), gethostname()
idlst = getinfo_worker.(workers()); df = DataFrame(workerid=Int[],physcpu=Int[],hostname=String[]); push!(df,idlst...)
gdf = groupby(df,:hostname)
@spawnat 1 ThreadPinning.pinthread(0)
for sgdf in gdf
    pcores = length(sgdf.workerid)
    for (sindx, sworker) in enumerate(sgdf.workerid)
        sendto(sworker, sindx=sindx)
        @spawnat sworker ThreadPinning.pinthreads([sindx-1,sindx-1+pcores])
    end
end

Elrod · April 10, 2023, 12:47am

Julia and BLAS threads are currently distinct, unfortunately. That is likely to change in the future.

Currently, this means that BLAS won’t use Julia’s threads.
You could, however, use Julia’s threads to execute multiple BLAS calls in parallel.

Topic		Replies	Views
How to prevent BLAS from thrashing with Julia? General Usage parallel	5	2188	May 30, 2017
Julia SLURM + BLAS + Multithreading, threads not mapping well leading to poor performance Performance multithreading , mpi , slurm	5	185	June 25, 2025
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2734	April 6, 2021
Julia Threads vs BLAS threads Internals & Design	16	10955	July 26, 2018
Pmap and multi-threaded BLAS Performance blas , parallel	2	958	November 29, 2019

Multiple BLISBLAS Threads on Workers

Related topics