Julia under rosetta 2 on mac m1: threading/scheduling issues with openblas?

Not sure where is the best place to post this: I noticed that with julia (x86) running under rosetta translation, for codes that heavily uses linear algebra, the performance can be 7-10 times slower with default (8) vs single-threaded openblas. It’s hard to get a minimum example, but a somewhat simple example is listed at the end of this post. Tested on Julia 1.70 rc3. My guess is the process may run in the slow efficiency core along with the fast performance core, causing a lot of wait/locks:

Looking into this further, it looks like by default, BLAS set thread to 8. When setting thread to 1, the code can be quite faster:

Default (BLAS.get_num_threads() returns 8)
1.024 s (11007 allocations: 6.04 MiB)

whereas setting BLAS thread to 1 BLAS.set_num_threads(1), the code runs
170.692 ms (11007 allocations: 6.04 MiB)

As a comparison, the native arm m1 1.7 build gives
55.083 ms (11007 allocations: 6.04 MiB)

It looks like under the default 8 threads, 2 would run in the slower efficiency core (maybe then causing wait?)

whereas in the 1-threaded openblas case, the peak is not very discernible

And in vs code, @profview shows the majority time would be on wait/lock (the tiny silver at the rightmost is the actual calculations with linear algebra codes)

Example:

using BenchmarkTools, LinearAlgebra
using ExponentialUtilities


function loop_ex(n, m)
    c = zeros(ComplexF64, n, n)
    cache = ExponentialUtilities.alloc_mem(c, ExpMethodHigham2005())
    for i = 1:m
        a = rand(ComplexF64, n, n)
        b = exponential!(a, ExpMethodHigham2005(a), cache)
        mul!(c, b, b, 1, 1)
    end
    c
end

@btime loop_ex(17, 1000)

Is there a way to force julia to run only in performance core by setting the correct qos? Should this be reported to github? Thanks!


edit: might be relevant: pr

1 Like

I set this on the normal m1, I imagine for the higher performance ones you just change the number and I don’t have issues. The scheduler then uses the high performance cores mostly

export OMP_NUM_THREADS=4
export JULIA_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4

I’m not sure this is a platform specific problem.

versioninfo()
println(BLAS.get_config())
println(BLAS.get_num_threads())
@btime loop_ex(17, 1000)
BLAS.set_num_threads(1)
println(BLAS.get_num_threads())
@btime loop_ex(17, 1000)

yields

Julia Version 1.7.0-rc3
Commit 3348de4ea6 (2021-11-15 08:22 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6
LBTConfig([ILP64] libopenblas64_.dll)
8
  457.824 ms (11007 allocations: 6.04 MiB)
1
  36.553 ms (11007 allocations: 6.04 MiB)

and

Julia Version 1.7.0-rc3
Commit 3348de4ea6 (2021-11-15 08:22 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6
LBTConfig([ILP64] mkl_rt.1.dll)
1
  36.639 ms (11007 allocations: 6.04 MiB)
1
  36.577 ms (11007 allocations: 6.04 MiB)

here.