OpenBLAS had two threading related defaults that we used to set until recently, both of which are now relaxed, since openblas has improved on both these fronts:
- We used to build openblas with 32 max threads (at some point this used to have a huge memory footprint)
- Julia would limit itself to no more than 8 openblas threads (this was to reduce julia startup latency due to openblas init stuff)
For example, if you had a 64 core box, you could set OPENBLAS_NUM_THREADS to get past the 8 thread startup default, but would still run into the 32 max thread compile time setting. The only solution was to build from source or replace your openblas binaries with the OpenBLASHighCoreCount binaries (if you knew about them and how to find them).
As of the latest master, we now have changed these defaults to:
- Build with 4096 max threads
- Let OpenBLAS detect the number of threads on startup, which is usually Sys.CPU_THREADS
With Julia master on a 20 core box (40 hyperthreads),
peakflops(10000) now gives me
4e11 on Julia 1.8 master, vs.
1.6e11 on Julia 1.6. This is out of the box performance with no defaults being tweaked. Here’s how you can check the default number of threads OpenBLAS starts with:
_ _ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 1.8.0-DEV.697 (2021-10-10) _/ |\__'_|_|_|\__'_| | Commit 9a2e763269 (0 days old master) |__/ | julia> Sys.CPU_THREADS 40 julia> using LinearAlgebra julia> BLAS.get_num_threads() 40
It would be great if people can try this out and file issues if they run into memory or startup latency issues. Also do share your experience with this change in this thread - I am sure we’ll find some performance regressions and such.