I am refining a BLAS intensive pipeline and have loved (and gotten a nice speedup from) BLISBLAS.jl. I happen to be running on a cluster with hyperthreading. Currently, I am running using addprocs(SlurmClusterManager()) across multiple nodes. Reading the ThreadPinning.jl docs,
If OPENBLAS_NUM_THREADS=1, OpenBLAS uses the calling Julia thread(s) to run BLAS computations, i.e. it “reuses” the Julia thread that runs a computation.
First, is that also true for BLIS_NUM_THREADS? And, if I wanted to take advantage of the hyperthreading, would I want to launch each worker with two julia threads, and set BLIS_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1?
Currently, I manually set the worker → cpu mapping manually as below. Any obvious ways to adapt that to the case I am proposing where we have two threads per worker?
getinfo_worker(workerid::Int) = @getfrom workerid myid(), ThreadPinning.sched_getcpu(), gethostname()
idlst = getinfo_worker.(workers())
df = DataFrame(workerid=Int[],physcpu=Int[],hostname=String[])
push!(df,idlst...)
gdf = groupby(df,:hostname)
for sgdf in gdf, (sindx, sworker) in enumerate(sgdf.workerid)
sendto(sworker, sindx=sindx)
@spawnat sworker ThreadPinning.pinthread(sindx-1)
end
I don’t know, but it would certainly be great to figure this out and add it to the docs you mentioned.
That largely depends on the application specifics. In general, I tend to avoid using hyperthreads. They can give a speed up for certain workloads but they may also have the opposite effect and also add quite a bit of complexity.
Thanks for the reply (and the nice packages). I also tend to avoid using hyperthreads, but I just wanted to do the test to see if it might help in this case.
I tried using julia -t 2 and BLIS_NUM_THREADS=1 and BLIS_NUM_THREADS=2 and saw no change on htop or timing compared to running julia single threaded. I also did this via BLIS.set_num_threads() with the same result (despite the fact that I could clearly get_num_threads() to be the value that I set).
I was running the following test which is close to the limiting step in my problem. I also tried (5000x8000) in case the size was too small to trigger BLISBLAS from using an additional available thread (again, without any change in behavior).
rng = MersenneTwister(203);
A = randn(rng,50,8000);
x = randn(rng,8000);
function testMult(A,x)
A*x
return
end
b = @benchmark testMult($A,$x)
I am very doubtful that this will be helpful to my application since the comparison with multithreading/not multithreading with OpenBLAS on this test problem was much worse. However, I wanted to learn how to pin the threads on workers manually across nodes… and so I will just put my solution here (which is VERY manual), in case it helps someone else.
getinfo_worker(workerid::Int) = @getfrom workerid myid(), ThreadPinning.sched_getcpu(), gethostname()
idlst = getinfo_worker.(workers()); df = DataFrame(workerid=Int[],physcpu=Int[],hostname=String[]); push!(df,idlst...)
gdf = groupby(df,:hostname)
@spawnat 1 ThreadPinning.pinthread(0)
for sgdf in gdf
pcores = length(sgdf.workerid)
for (sindx, sworker) in enumerate(sgdf.workerid)
sendto(sworker, sindx=sindx)
@spawnat sworker ThreadPinning.pinthreads([sindx-1,sindx-1+pcores])
end
end