BLAS vs Threads on a cluster

SpuriousEigenstate · April 22, 2024, 11:47am

I’m developing a script which, at its core, is mainly diagonalizing a large number of Hermitian matrices, i.e., eigen(Hermitian(A)). I want to run the code on a cluster (on a single machine, 96 threads) and am looking into multithreading it using Threads.@threads.

In testing on my desktop (16 threads), I noticed that the LinearAlgebra functions automatically make use of multiple threads. So my question is: is it better (faster) to let all the threads be used by eigen, or to restrict eigen to a single thread and parallelize using Threads? Or somewhere in between?

Oscar_Smith · April 22, 2024, 12:09pm

Generally you want to set BLAS to a single thread if you can paralellize at a higher level easily. BLAS multithreading doesn’t give you perfect scaling.

SpuriousEigenstate · April 22, 2024, 12:29pm

Thanks! A simple test shows that you are right:

using LinearAlgebra

A = rand(800,800)

@time Threads.@threads for i in 1:30
    eigen(A)
end

BLAS.set_num_threads(1)
@time Threads.@threads for i in 1:30
    eigen(A)
end

output:

 12.973553 seconds (31.24 k allocations: 613.706 MiB, 0.16% gc time, 3.23% compilation time)
  5.900912 seconds (22.85 k allocations: 613.151 MiB, 0.23% gc time, 9.74% compilation time)

abraemer · April 23, 2024, 5:18am

Also note that there is an interaction between BLAS threads and julia threads that depends on whether you use MKL or the default OpenBLAS. In short:

with MKL: total threads used = julia threads x BLAS threads
with OpenBLAS: total threads used = julia threads + BLAS threads (BLAS will compute “jobs” sequentially but use threads within each “job”)

More details explanation can be found here:

(Also using ThreadPinning.jl might improve performance if you use a cluster)

gdalle · April 23, 2024, 6:42am

When I wrote the docs section about this I buried it at the end of the performance tips, but maybe there is a better place?

https://docs.julialang.org/en/v1/manual/performance-tips/#man-multithreading-linear-algebra

abraemer · April 23, 2024, 7:20am

I think the Performance tips section is a reasonable place in principle. I usually link to the explanation in ThreadPinning.jl because it is a bit more detailed and also highlights the different behavior of MKL.jl (which I deem a very surprising footgun…).

Maybe a lesson here could be that the Performance Tips section got a bit too long to be unstructured? Maybe we could make some sections like “Type related stuff”, “Function related stuff”, “Numerics”, “Miscellaneous”. Maybe lead with a section “Most common and severe performance pitfalls”.

For the specific performance tip about OpenBLAS, I think a bit more emphasis on the (to me at surprising) behavior of OPENBLAS_NUM_THREADS=N>1 would be good. The current “There is just one OpenBLAS thread pool shared among all Julia threads.” does not scream to me “your operations will essentially be done serially (but a bit faster)”.

gdalle · April 23, 2024, 8:06am

I’ve been wanting to do that for a while, but the Documenter format doesn’t make such subsections visible, so I’m not sure how to proceed

Feel free to submit a PR, I didn’t write this part because I was an expert, only because I was frustrated not to find it anywhere official ^^

Topic		Replies	Views
Julia Threads vs BLAS threads Internals & Design	16	10911	July 26, 2018
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	391	November 13, 2023
Operations on small matrices and BLAS in 1.6.0 RC1 New to Julia	2	413	March 2, 2021
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5281	January 31, 2022
Parallel computing with * Performance question	27	1095	December 29, 2022

BLAS vs Threads on a cluster

Related topics