Strange multithreaded behaviour on ARM Ampere A1 (Oracle Cloud)

Dandan · April 14, 2022, 2:01pm

I’ve tried to run the following simple benchmark in Oracle Cloud VM on an instance with 4 vCPU of ARM Ampere A1 and 24GB ram:

using LinearAlgebra

function foo()
	Threads.@threads for i in 1:10000
		H = rand(100, 100) + im * rand(100, 100)
		eigen(Hermitian(H))
	end
end

@time foo()

When I use

julia -t 4

the running time is around 3 times longer in comparison to the single threaded test.

In both cases Julia correctly sees the value of Threads.nthreads().

Also, in both cases the CPUs utilization is maximal (400%).

I wonder what is going on?

Julia version 1.7.2 (aarch64)

mkitti · April 14, 2022, 2:34pm

If you run lscpu at the shell, is there more than one NUMA node?

mkitti · April 14, 2022, 2:36pm

You might also want to look at BLAS.set_num_threads to set the number of threads that OpenBLAS is trying to uses:

Dandan · April 14, 2022, 2:57pm

It’s only 1 NUMA node

$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        50.00
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc
                                 pop asimddp ssbs

BLAS.get_num_threads() returns 4 in both cases

mkitti · April 14, 2022, 2:59pm

Try setting it to 1 via BLAS.set_num_threads(1)

Dandan · April 14, 2022, 3:26pm

Right, with BLAS.set_num_threads(1) and 4 threads the code runs 3x times faster than single threaded.

Though, the single threaded case is still strange.
With BLAS.set_num_threads(1) the CPU utilization becomes 100% in this case, but the running time remains the same!

mkitti · April 14, 2022, 3:31pm

How are you measuring CPU utilization?

Dandan · April 14, 2022, 3:33pm

It’s what top command shows

mkitti · April 14, 2022, 3:52pm

htop may give you a better idea on what is happening on a per processor basis.

ImreSamu · April 14, 2022, 3:57pm

Neoverse-N1

Julia 1.7 - has OPENBLAS_VER := 0.3.13

And check the OpenBLAS changelog ( search “neoverse” )

“3.16 : fixed missing restore of a register … Neoverse N1 that could cause spurious failures in e.g. DGEEV”
“3.14 : Fixed the THUNDERX2T99 and NEOVERSEN1 DNRM2/ZNRM2 kernels for inputs with Inf”

suggestions:

check Julia v1.8.0-beta3 (March 29, 2022)
- it has “OPENBLAS_VER := 0.3.17”
if it is not enough - recompile Julia 1.8 source with -march=armv8.2-a -mtune=neoverse-n1

Dandan · April 14, 2022, 6:56pm

I’ve tried beta version and recompilation, but the results are the same.

It’s what htop shows when the code runs with the default number of BLAS threads (4).

Update

Actually, BLAS.set_num_threads(1) has a similar behavior on x86 systems – it makes the code run a little bit faster and with much lesser CPU utilization. But -t 4 has a consistent behavior there. The code runs about 3x faster than -t 1 with the same value of BLAS threads.

On this neoverse-n1 system running -t 4 with BLAS.set_num_threads(1) is more than x10 times faster then the default BLAS.set_num_threads(4).

Topic		Replies	Views
Julia Threads vs BLAS threads Internals & Design	16	10957	July 26, 2018
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5443	January 31, 2022
Why julia is not using all my CPU? General Usage	18	3821	April 25, 2020
BLAS performance testing for Julia 1.8 Performance blas , multithreading	30	8082	July 19, 2022
Multi-threading of julia-1.8.5 does not improve speed when combined with BLAS New to Julia	17	1459	May 1, 2023

Strange multithreaded behaviour on ARM Ampere A1 (Oracle Cloud)

Related topics