Multiple BLISBLAS Threads on Workers

I am refining a BLAS intensive pipeline and have loved (and gotten a nice speedup from) BLISBLAS.jl. I happen to be running on a cluster with hyperthreading. Currently, I am running using addprocs(SlurmClusterManager()) across multiple nodes. Reading the ThreadPinning.jl docs,

If OPENBLAS_NUM_THREADS=1, OpenBLAS uses the calling Julia thread(s) to run BLAS computations, i.e. it “reuses” the Julia thread that runs a computation.

First, is that also true for BLIS_NUM_THREADS? And, if I wanted to take advantage of the hyperthreading, would I want to launch each worker with two julia threads, and set BLIS_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1?

Currently, I manually set the worker → cpu mapping manually as below. Any obvious ways to adapt that to the case I am proposing where we have two threads per worker?

getinfo_worker(workerid::Int) = @getfrom workerid myid(), ThreadPinning.sched_getcpu(), gethostname()
idlst = getinfo_worker.(workers()) 
df = DataFrame(workerid=Int[],physcpu=Int[],hostname=String[])
push!(df,idlst...)
gdf = groupby(df,:hostname)
for sgdf in gdf, (sindx, sworker) in enumerate(sgdf.workerid)
    sendto(sworker, sindx=sindx)
    @spawnat sworker ThreadPinning.pinthread(sindx-1)
end

I don’t know, but it would certainly be great to figure this out and add it to the docs you mentioned.

That largely depends on the application specifics. In general, I tend to avoid using hyperthreads. They can give a speed up for certain workloads but they may also have the opposite effect and also add quite a bit of complexity.

IIUC, you would benefit from MPI: Improve "manual" pinning (`pinthreads_mpi`) · Issue #61 · carstenbauer/ThreadPinning.jl · GitHub (if generalized beyond MPI). It’s not high on my priority list but will hopefully happen at some point. Feel free to move this forward yourself if you want to.

1 Like

Thanks for the reply (and the nice packages). I also tend to avoid using hyperthreads, but I just wanted to do the test to see if it might help in this case.

I tried using julia -t 2 and BLIS_NUM_THREADS=1 and BLIS_NUM_THREADS=2 and saw no change on htop or timing compared to running julia single threaded. I also did this via BLIS.set_num_threads() with the same result (despite the fact that I could clearly get_num_threads() to be the value that I set).

I was running the following test which is close to the limiting step in my problem. I also tried (5000x8000) in case the size was too small to trigger BLISBLAS from using an additional available thread (again, without any change in behavior).

rng = MersenneTwister(203);
A = randn(rng,50,8000);
x = randn(rng,8000);

function testMult(A,x)
    A*x
    return
end

b = @benchmark testMult($A,$x)

I am very doubtful that this will be helpful to my application since the comparison with multithreading/not multithreading with OpenBLAS on this test problem was much worse. However, I wanted to learn how to pin the threads on workers manually across nodes… and so I will just put my solution here (which is VERY manual), in case it helps someone else.

getinfo_worker(workerid::Int) = @getfrom workerid myid(), ThreadPinning.sched_getcpu(), gethostname()
idlst = getinfo_worker.(workers()); df = DataFrame(workerid=Int[],physcpu=Int[],hostname=String[]); push!(df,idlst...)
gdf = groupby(df,:hostname)
@spawnat 1 ThreadPinning.pinthread(0)
for sgdf in gdf
    pcores = length(sgdf.workerid)
    for (sindx, sworker) in enumerate(sgdf.workerid)
        sendto(sworker, sindx=sindx)
        @spawnat sworker ThreadPinning.pinthreads([sindx-1,sindx-1+pcores])
    end
end

Julia and BLAS threads are currently distinct, unfortunately. That is likely to change in the future.

Currently, this means that BLAS won’t use Julia’s threads.
You could, however, use Julia’s threads to execute multiple BLAS calls in parallel.