Dandan
April 14, 2022, 2:01pm
#1
I’ve tried to run the following simple benchmark in Oracle Cloud VM on an instance with 4 vCPU of ARM Ampere A1 and 24GB ram:
using LinearAlgebra
function foo()
Threads.@threads for i in 1:10000
H = rand(100, 100) + im * rand(100, 100)
eigen(Hermitian(H))
end
end
@time foo()
When I use
julia -t 4
the running time is around 3 times longer in comparison to the single threaded test.
In both cases Julia correctly sees the value of Threads.nthreads().
Also, in both cases the CPUs utilization is maximal (400%).
I wonder what is going on?
Julia version 1.7.2 (aarch64)
mkitti
April 14, 2022, 2:34pm
#2
Dandan:
4 vCPU of ARM Ampere A1
If you run lscpu
at the shell, is there more than one NUMA node?
mkitti
April 14, 2022, 2:36pm
#3
Dandan:
eigen
You might also want to look at BLAS.set_num_threads
to set the number of threads that OpenBLAS is trying to uses:
I am experimenting with Julia’s (experimental) multithreading feature recently and like the results so far. One of the problems I need to deal with, involves the multiplication of several pairs (say in the order 5 to 50) of matrices, whose size is average (say linear size in the order 10 - 1000). For that problem, there can be a competition between either using Julia threads (to loop over the different pairs) versus using multithreaded matrix multiplication provided by BLAS, and it will depend …
Dandan
April 14, 2022, 2:57pm
#4
It’s only 1 NUMA node
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
Stepping: r3p1
BogoMIPS: 50.00
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc
pop asimddp ssbs
BLAS.get_num_threads()
returns 4 in both cases
mkitti
April 14, 2022, 2:59pm
#5
Try setting it to 1 via BLAS.set_num_threads(1)
2 Likes
Dandan
April 14, 2022, 3:26pm
#6
Right, with BLAS.set_num_threads(1)
and 4 threads the code runs 3x times faster than single threaded.
Though, the single threaded case is still strange.
With BLAS.set_num_threads(1)
the CPU utilization becomes 100% in this case, but the running time remains the same!
mkitti
April 14, 2022, 3:31pm
#7
How are you measuring CPU utilization?
Dandan
April 14, 2022, 3:33pm
#8
It’s what top
command shows
mkitti
April 14, 2022, 3:52pm
#9
htop
may give you a better idea on what is happening on a per processor basis.
Neoverse-N1
Julia 1.7 - has OPENBLAS_VER := 0.3.13
And check the OpenBLAS changelog ( search “neoverse” )
“3.16 : fixed missing restore of a register … Neoverse N1 that could cause spurious failures in e.g. DGEEV”
“3.14 : Fixed the THUNDERX2T99 and NEOVERSEN1 DNRM2/ZNRM2 kernels for inputs with Inf”
suggestions:
1 Like
Dandan
April 14, 2022, 6:56pm
#11
I’ve tried beta version and recompilation, but the results are the same.
It’s what htop
shows when the code runs with the default number of BLAS threads (4).
Update
Actually, BLAS.set_num_threads(1)
has a similar behavior on x86 systems – it makes the code run a little bit faster and with much lesser CPU utilization. But -t 4
has a consistent behavior there. The code runs about 3x faster than -t 1
with the same value of BLAS threads.
On this neoverse-n1
system running -t 4
with BLAS.set_num_threads(1)
is more than x10 times faster then the default BLAS.set_num_threads(4)
.
1 Like