I’m developing a script which, at its core, is mainly diagonalizing a large number of Hermitian matrices, i.e., eigen(Hermitian(A)). I want to run the code on a cluster (on a single machine, 96 threads) and am looking into multithreading it using Threads.@threads.
In testing on my desktop (16 threads), I noticed that the LinearAlgebra functions automatically make use of multiple threads. So my question is: is it better (faster) to let all the threads be used by eigen, or to restrict eigen to a single thread and parallelize using Threads? Or somewhere in between?
Generally you want to set BLAS to a single thread if you can paralellize at a higher level easily. BLAS multithreading doesn’t give you perfect scaling.
using LinearAlgebra
A = rand(800,800)
@time Threads.@threads for i in 1:30
eigen(A)
end
BLAS.set_num_threads(1)
@time Threads.@threads for i in 1:30
eigen(A)
end
output:
12.973553 seconds (31.24 k allocations: 613.706 MiB, 0.16% gc time, 3.23% compilation time)
5.900912 seconds (22.85 k allocations: 613.151 MiB, 0.23% gc time, 9.74% compilation time)
I think the Performance tips section is a reasonable place in principle. I usually link to the explanation in ThreadPinning.jl because it is a bit more detailed and also highlights the different behavior of MKL.jl (which I deem a very surprising footgun…).
Maybe a lesson here could be that the Performance Tips section got a bit too long to be unstructured? Maybe we could make some sections like “Type related stuff”, “Function related stuff”, “Numerics”, “Miscellaneous”. Maybe lead with a section “Most common and severe performance pitfalls”.
For the specific performance tip about OpenBLAS, I think a bit more emphasis on the (to me at surprising) behavior of OPENBLAS_NUM_THREADS=N>1 would be good. The current “There is just one OpenBLAS thread pool shared among all Julia threads.” does not scream to me “your operations will essentially be done serially (but a bit faster)”.