Ideal number of BLAS threads

Hi,

I am planning to run a heavy series of matrix operations on a cluster and I was wondering what’s the ideal number of BLAS threads I should use. Assume that I have a machine with 20 cores and 8-16 GB of RAM per core. Does it make sense to use more BLAS threads than cores? If so, is there any rule of thumb I should follow?

PS: I am migrating towards Octavian.jl. How would it change if I were to use it instead of LinearAlgebra.jl?

1 Like

No.
Octavian (and MKL) should also both use at most 1 thread per core.
Octavian runs on Julia’s threads, rather than being controlled by BLAS.set_num_threads().

5 Likes

Just use

julia -t (number of physical cores)

to start julia for Octavian.jl

2 Likes

I have a follow-up question. I have noticed that on a cluster with 1vCPU per core and a total of 64 cores BLAS.set_num_threads() sets a maximum of 32 cores (not 64). Would you please clarify what is the reasoning behind it? I have noticed with lscpu that there are 32 cores per socket - is that it? Also, is there any advantage in using a different ratio of vCPU per core?

Yes, that is probably it.
No longer using Hwloc.jl means CPUSummary.jl doesn’t know the number of machines.

He says that BLAS.set_num_threads is limited to max 32 threads which shouldn’t be related to Hwloc.jl or CPUSummary.jl at all. Instead it’s just that Julia <= 1.7 has a built in hard limit of max 32 OpenBLAS threads. To work around it you can either use a different BLAS or switch to Julia > 1.7. (or change the limit manually and compile Julia yourself).

Ha, sorry – I should have read more closely / was still thinking about the earlier discussion mentioning Octavian.
It’s threads would be limited by this.

Yes, for BLAS.set_num_threads/base Julia, use Julia > 1.7.
If on x86, you can also consider trying MKL.jl. MKL should perform well on Intel, and recent versions may at least be competitive with OpenBLAS on AMD.

1 Like

If you’re curious, you can also be (most likely) the first person besides me to try GitHub - carstenbauer/BLISBLAS.jl: BLIS-pendant of MKL.jl. BLIS should run well and can outperform OpenBLAS, in particular on AMD, in some cases.

Note though that BLIS only provides BLAS and no LAPACK (OpenBLAS will still be used for this).

1 Like

Neat.
I’ve tried BLIS_jll before, but it performed extremely poorly at small sizes, especially when multithreading. At the time at least, they didn’t have an equivalent of OpenBLAS’s multithreaded gemm threshold (so they were always threading, regardless of how small).

Apart to above suggestions you may also take a look at Julia 1.8 [BLAS performance testing for Julia 1.8]. I have not followed it recently, however as I understand it currently, there might be some significant changes and automation associated with BLAS and 1.8 release.

@carstenbauer Would BLISBLAS.jl work on Neoverse N1 (Ampere Altra)? I have heard some words that on this particular CPU BLIS might be one of the most favorable options.

I am confirming it seems to work without any problems on Neoverse-N1. As for the performance, I was not able to do any in depth testing. Some preliminary info based on the code from the repository pls find below.
I am recalling I had a discussion about BLIS on ARM (Neoverse-N1) a few months ago and it looked challenging for me to make it work then. Thank you for putting all this together.

lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
Stepping: r3p1
BogoMIPS: 50.00
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

./julia -t auto
julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Linux (aarch64-unknown-linux-gnu)
CPU: unknown
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, neoverse-n1)

BLIS:

julia> using LinearAlgebra
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.so
julia> using BLISBLAS
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] libopenblas64_.so
└ [ILP64] libblis.so
julia> using BenchmarkTools
julia> A = rand(1000,1000); B = rand(1000,1000);
julia> @btime $A * $B;
98.963 ms (2 allocations: 7.63 MiB)

OPENBLAS:

julia> using LinearAlgebra
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.so
julia> using BenchmarkTools
julia> A = rand(1000,1000); B = rand(1000,1000);
julia> @btime $A * $B;
29.436 ms (2 allocations: 7.63 MiB)