Dandan
April 14, 2022, 2:01pm
1
I’ve tried to run the following simple benchmark in Oracle Cloud VM on an instance with 4 vCPU of ARM Ampere A1 and 24GB ram:
using LinearAlgebra
function foo()
Threads.@threads for i in 1:10000
H = rand(100, 100) + im * rand(100, 100)
eigen(Hermitian(H))
end
end
@time foo()
When I use
julia -t 4
the running time is around 3 times longer in comparison to the single threaded test.
In both cases Julia correctly sees the value of Threads.nthreads().
Also, in both cases the CPUs utilization is maximal (400%).
I wonder what is going on?
Julia version 1.7.2 (aarch64)
mkitti
April 14, 2022, 2:34pm
2
Dandan:
4 vCPU of ARM Ampere A1
If you run lscpu
at the shell, is there more than one NUMA node?
mkitti
April 14, 2022, 2:36pm
3
Dandan:
eigen
You might also want to look at BLAS.set_num_threads
to set the number of threads that OpenBLAS is trying to uses:
I am experimenting with Julia’s (experimental) multithreading feature recently and like the results so far. One of the problems I need to deal with, involves the multiplication of several pairs (say in the order 5 to 50) of matrices, whose size is average (say linear size in the order 10 - 1000). For that problem, there can be a competition between either using Julia threads (to loop over the different pairs) versus using multithreaded matrix multiplication provided by BLAS, and it will depend …
Dandan
April 14, 2022, 2:57pm
4
It’s only 1 NUMA node
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
Stepping: r3p1
BogoMIPS: 50.00
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc
pop asimddp ssbs
BLAS.get_num_threads()
returns 4 in both cases
mkitti
April 14, 2022, 2:59pm
5
Try setting it to 1 via BLAS.set_num_threads(1)
2 Likes
Dandan
April 14, 2022, 3:26pm
6
Right, with BLAS.set_num_threads(1)
and 4 threads the code runs 3x times faster than single threaded.
Though, the single threaded case is still strange.
With BLAS.set_num_threads(1)
the CPU utilization becomes 100% in this case, but the running time remains the same!
mkitti
April 14, 2022, 3:31pm
7
How are you measuring CPU utilization?
Dandan
April 14, 2022, 3:33pm
8
It’s what top
command shows
mkitti
April 14, 2022, 3:52pm
9
htop
may give you a better idea on what is happening on a per processor basis.
Neoverse-N1
Julia 1.7 - has OPENBLAS_VER := 0.3.13
And check the OpenBLAS changelog ( search “neoverse” )
“3.16 : fixed missing restore of a register … Neoverse N1 that could cause spurious failures in e.g. DGEEV”
“3.14 : Fixed the THUNDERX2T99 and NEOVERSEN1 DNRM2/ZNRM2 kernels for inputs with Inf”
suggestions:
1 Like
Dandan
April 14, 2022, 6:56pm
11
I’ve tried beta version and recompilation, but the results are the same.
It’s what htop
shows when the code runs with the default number of BLAS threads (4).
Update
Actually, BLAS.set_num_threads(1)
has a similar behavior on x86 systems – it makes the code run a little bit faster and with much lesser CPU utilization. But -t 4
has a consistent behavior there. The code runs about 3x faster than -t 1
with the same value of BLAS threads.
On this neoverse-n1
system running -t 4
with BLAS.set_num_threads(1)
is more than x10 times faster then the default BLAS.set_num_threads(4)
.
1 Like