BLAS vs Threads on a cluster

I’m developing a script which, at its core, is mainly diagonalizing a large number of Hermitian matrices, i.e., eigen(Hermitian(A)). I want to run the code on a cluster (on a single machine, 96 threads) and am looking into multithreading it using Threads.@threads.

In testing on my desktop (16 threads), I noticed that the LinearAlgebra functions automatically make use of multiple threads. So my question is: is it better (faster) to let all the threads be used by eigen, or to restrict eigen to a single thread and parallelize using Threads? Or somewhere in between?

Generally you want to set BLAS to a single thread if you can paralellize at a higher level easily. BLAS multithreading doesn’t give you perfect scaling.

1 Like

Thanks! A simple test shows that you are right:

using LinearAlgebra

A = rand(800,800)

@time Threads.@threads for i in 1:30
    eigen(A)
end

BLAS.set_num_threads(1)
@time Threads.@threads for i in 1:30
    eigen(A)
end

output:

 12.973553 seconds (31.24 k allocations: 613.706 MiB, 0.16% gc time, 3.23% compilation time)
  5.900912 seconds (22.85 k allocations: 613.151 MiB, 0.23% gc time, 9.74% compilation time)

Also note that there is an interaction between BLAS threads and julia threads that depends on whether you use MKL or the default OpenBLAS. In short:

  • with MKL: total threads used = julia threads x BLAS threads
  • with OpenBLAS: total threads used = julia threads + BLAS threads (BLAS will compute “jobs” sequentially but use threads within each “job”)

More details explanation can be found here:

(Also using ThreadPinning.jl might improve performance if you use a cluster)

When I wrote the docs section about this I buried it at the end of the performance tips, but maybe there is a better place?

https://docs.julialang.org/en/v1/manual/performance-tips/#man-multithreading-linear-algebra

2 Likes

I think the Performance tips section is a reasonable place in principle. I usually link to the explanation in ThreadPinning.jl because it is a bit more detailed and also highlights the different behavior of MKL.jl (which I deem a very surprising footgun…).

Maybe a lesson here could be that the Performance Tips section got a bit too long to be unstructured? Maybe we could make some sections like “Type related stuff”, “Function related stuff”, “Numerics”, “Miscellaneous”. Maybe lead with a section “Most common and severe performance pitfalls”.

For the specific performance tip about OpenBLAS, I think a bit more emphasis on the (to me at surprising) behavior of OPENBLAS_NUM_THREADS=N>1 would be good. The current “There is just one OpenBLAS thread pool shared among all Julia threads.” does not scream to me “your operations will essentially be done serially (but a bit faster)”.

1 Like

I’ve been wanting to do that for a while, but the Documenter format doesn’t make such subsections visible, so I’m not sure how to proceed

Feel free to submit a PR, I didn’t write this part because I was an expert, only because I was frustrated not to find it anywhere official ^^

1 Like