Julia under rosetta 2 on mac m1: threading/scheduling issues with openblas?

Egwene_al_Vere · November 25, 2021, 10:50pm

Not sure where is the best place to post this: I noticed that with julia (x86) running under rosetta translation, for codes that heavily uses linear algebra, the performance can be 7-10 times slower with default (8) vs single-threaded openblas. It’s hard to get a minimum example, but a somewhat simple example is listed at the end of this post. Tested on Julia 1.70 rc3. My guess is the process may run in the slow efficiency core along with the fast performance core, causing a lot of wait/locks:

Looking into this further, it looks like by default, BLAS set thread to 8. When setting thread to 1, the code can be quite faster:

Default (BLAS.get_num_threads() returns 8)
1.024 s (11007 allocations: 6.04 MiB)

whereas setting BLAS thread to 1 BLAS.set_num_threads(1), the code runs
170.692 ms (11007 allocations: 6.04 MiB)

As a comparison, the native arm m1 1.7 build gives
55.083 ms (11007 allocations: 6.04 MiB)

It looks like under the default 8 threads, 2 would run in the slower efficiency core (maybe then causing wait?)

whereas in the 1-threaded openblas case, the peak is not very discernible

And in vs code, @profview shows the majority time would be on wait/lock (the tiny silver at the rightmost is the actual calculations with linear algebra codes)

Example:

using BenchmarkTools, LinearAlgebra
using ExponentialUtilities


function loop_ex(n, m)
    c = zeros(ComplexF64, n, n)
    cache = ExponentialUtilities.alloc_mem(c, ExpMethodHigham2005())
    for i = 1:m
        a = rand(ComplexF64, n, n)
        b = exponential!(a, ExpMethodHigham2005(a), cache)
        mul!(c, b, b, 1, 1)
    end
    c
end

@btime loop_ex(17, 1000)

Is there a way to force julia to run only in performance core by setting the correct qos? Should this be reported to github? Thanks!

–
edit: might be relevant: pr

gbaraldi · November 26, 2021, 1:36am

I set this on the normal m1, I imagine for the higher performance ones you just change the number and I don’t have issues. The scheduler then uses the high performance cores mostly

export OMP_NUM_THREADS=4
export JULIA_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4

goerch · November 26, 2021, 4:07am

I’m not sure this is a platform specific problem.

versioninfo()
println(BLAS.get_config())
println(BLAS.get_num_threads())
@btime loop_ex(17, 1000)
BLAS.set_num_threads(1)
println(BLAS.get_num_threads())
@btime loop_ex(17, 1000)

yields

Julia Version 1.7.0-rc3
Commit 3348de4ea6 (2021-11-15 08:22 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6
LBTConfig([ILP64] libopenblas64_.dll)
8
  457.824 ms (11007 allocations: 6.04 MiB)
1
  36.553 ms (11007 allocations: 6.04 MiB)

and

Julia Version 1.7.0-rc3
Commit 3348de4ea6 (2021-11-15 08:22 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6
LBTConfig([ILP64] mkl_rt.1.dll)
1
  36.639 ms (11007 allocations: 6.04 MiB)
1
  36.577 ms (11007 allocations: 6.04 MiB)

here.

Topic		Replies	Views
BLAS performance testing for Julia 1.8 Performance blas , multithreading	30	8085	July 19, 2022
Julia Threads vs BLAS threads Internals & Design	16	10957	July 26, 2018
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5450	January 31, 2022
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3533	May 16, 2021
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2743	April 6, 2021

Julia under rosetta 2 on mac m1: threading/scheduling issues with openblas?

Related topics