I’ve tried to run the following simple benchmark in Oracle Cloud VM on an instance with 4 vCPU of ARM Ampere A1 and 24GB ram:
Threads.@threads for i in 1:10000
H = rand(100, 100) + im * rand(100, 100)
When I use
julia -t 4
the running time is around 3 times longer in comparison to the single threaded test.
In both cases Julia correctly sees the value of Threads.nthreads().
Also, in both cases the CPUs utilization is maximal (400%).
I wonder what is going on?
Julia version 1.7.2 (aarch64)
4 vCPU of ARM Ampere A1
If you run
lscpu at the shell, is there more than one NUMA node?
You might also want to look at
BLAS.set_num_threads to set the number of threads that OpenBLAS is trying to uses:
I am experimenting with Julia’s (experimental) multithreading feature recently and like the results so far. One of the problems I need to deal with, involves the multiplication of several pairs (say in the order 5 to 50) of matrices, whose size is average (say linear size in the order 10 - 1000). For that problem, there can be a competition between either using Julia threads (to loop over the different pairs) versus using multithreaded matrix multiplication provided by BLAS, and it will depend …
It’s only 1 NUMA node
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
NUMA node(s): 1
Vendor ID: ARM
Model name: Neoverse-N1
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc
pop asimddp ssbs
BLAS.get_num_threads() returns 4 in both cases
Try setting it to 1 via
BLAS.set_num_threads(1) and 4 threads the code runs 3x times faster than single threaded.
Though, the single threaded case is still strange.
BLAS.set_num_threads(1) the CPU utilization becomes 100% in this case, but the running time remains the same!
How are you measuring CPU utilization?
top command shows
htop may give you a better idea on what is happening on a per processor basis.
Julia 1.7 - has
OPENBLAS_VER := 0.3.13
And check the OpenBLAS
changelog ( search “neoverse” )
“3.16 : fixed missing restore of a register … Neoverse N1 that could cause spurious failures in e.g. DGEEV”
“3.14 : Fixed the THUNDERX2T99 and NEOVERSEN1 DNRM2/ZNRM2 kernels for inputs with Inf”
I’ve tried beta version and recompilation, but the results are the same.
htop shows when the code runs with the default number of BLAS threads (4).
BLAS.set_num_threads(1) has a similar behavior on x86 systems – it makes the code run a little bit faster and with much lesser CPU utilization. But
-t 4 has a consistent behavior there. The code runs about 3x faster than
-t 1 with the same value of BLAS threads.
neoverse-n1 system running
-t 4 with
BLAS.set_num_threads(1) is more than x10 times faster then the default