Strange multithreaded behaviour on ARM Ampere A1 (Oracle Cloud)

I’ve tried to run the following simple benchmark in Oracle Cloud VM on an instance with 4 vCPU of ARM Ampere A1 and 24GB ram:

using LinearAlgebra

function foo()
	Threads.@threads for i in 1:10000
		H = rand(100, 100) + im * rand(100, 100)
		eigen(Hermitian(H))
	end
end

@time foo()

When I use

julia -t 4

the running time is around 3 times longer in comparison to the single threaded test.

In both cases Julia correctly sees the value of Threads.nthreads().

Also, in both cases the CPUs utilization is maximal (400%).

I wonder what is going on?

Julia version 1.7.2 (aarch64)

If you run lscpu at the shell, is there more than one NUMA node?

You might also want to look at BLAS.set_num_threads to set the number of threads that OpenBLAS is trying to uses:

It’s only 1 NUMA node

$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        50.00
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc
                                 pop asimddp ssbs

BLAS.get_num_threads() returns 4 in both cases

Try setting it to 1 via BLAS.set_num_threads(1)

2 Likes

Right, with BLAS.set_num_threads(1) and 4 threads the code runs 3x times faster than single threaded.

Though, the single threaded case is still strange.
With BLAS.set_num_threads(1) the CPU utilization becomes 100% in this case, but the running time remains the same!

How are you measuring CPU utilization?

It’s what top command shows

htop may give you a better idea on what is happening on a per processor basis.

image

Neoverse-N1

Julia 1.7 - has OPENBLAS_VER := 0.3.13

And check the OpenBLAS changelog ( search “neoverse” )

  • “3.16 : fixed missing restore of a register … Neoverse N1 that could cause spurious failures in e.g. DGEEV”
  • “3.14 : Fixed the THUNDERX2T99 and NEOVERSEN1 DNRM2/ZNRM2 kernels for inputs with Inf”

suggestions:

1 Like

I’ve tried beta version and recompilation, but the results are the same.

It’s what htop shows when the code runs with the default number of BLAS threads (4).

Update

Actually, BLAS.set_num_threads(1) has a similar behavior on x86 systems – it makes the code run a little bit faster and with much lesser CPU utilization. But -t 4 has a consistent behavior there. The code runs about 3x faster than -t 1 with the same value of BLAS threads.

On this neoverse-n1 system running -t 4 with BLAS.set_num_threads(1) is more than x10 times faster then the default BLAS.set_num_threads(4).

1 Like