Not sure where is the best place to post this: I noticed that with julia (x86) running under rosetta translation, for codes that heavily uses linear algebra, the performance can be 7-10 times slower with default (8) vs single-threaded openblas. It’s hard to get a minimum example, but a somewhat simple example is listed at the end of this post. Tested on Julia 1.70 rc3. My guess is the process may run in the slow efficiency core along with the fast performance core, causing a lot of wait/locks:
Looking into this further, it looks like by default, BLAS set thread to 8. When setting thread to 1, the code can be quite faster:
Default (BLAS.get_num_threads()
returns 8)
1.024 s (11007 allocations: 6.04 MiB)
whereas setting BLAS thread to 1 BLAS.set_num_threads(1)
, the code runs
170.692 ms (11007 allocations: 6.04 MiB)
As a comparison, the native arm m1 1.7 build gives
55.083 ms (11007 allocations: 6.04 MiB)
It looks like under the default 8 threads, 2 would run in the slower efficiency core (maybe then causing wait?)
whereas in the 1-threaded openblas case, the peak is not very discernible
And in vs code, @profview
shows the majority time would be on wait/lock (the tiny silver at the rightmost is the actual calculations with linear algebra codes)
Example:
using BenchmarkTools, LinearAlgebra
using ExponentialUtilities
function loop_ex(n, m)
c = zeros(ComplexF64, n, n)
cache = ExponentialUtilities.alloc_mem(c, ExpMethodHigham2005())
for i = 1:m
a = rand(ComplexF64, n, n)
b = exponential!(a, ExpMethodHigham2005(a), cache)
mul!(c, b, b, 1, 1)
end
c
end
@btime loop_ex(17, 1000)
Is there a way to force julia to run only in performance core by setting the correct qos? Should this be reported to github? Thanks!
–
edit: might be relevant: pr