Does Mac M1 in multithreads is slower that in single thread?

BMval · May 13, 2021, 11:48pm

I’ve run a simple code on mac M1 (pro, 8gb ram), Julia is the last version.

@Threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end

1 core = 50 sec
2 core = 62 sec
4 core = 71 sec
8 core = doesn’t work at all. (too long wait)

Is the result since there is no native version of Julia for m1? Or is there something else?

gbaraldi · May 14, 2021, 12:04am

I was trying some things out and for some reason the rosetta2 version is much faster than the native 1.7 dev build. But its very likely that this is a memory limited benchmark

BMval · May 14, 2021, 12:55am

Is the native m1 version available already?

pixel27 · May 14, 2021, 1:09am

Can you give more information on “doesn’t work at all”. Do you mean it never returns, crashes Julia, crashes the computer, returns an error (what error)?

Best guess is either you are running into a memory bandwidth issue (as @gbaraldi said), Each thread is trying to access 3 different arrays and everything cannot fit the CPU cache so data is constantly moving from RAM to the CPU and back. Or you are running into a garbage collection issue as each thread allocates at least 3 arrays then frees them.

You could try this code:

using Random
using Base.Threads

function t(iter)
    d1 = Dict{Int, Array{Float64, 2}}()
    d2 = Dict{Int, Array{Float64, 2}}()
    d3 = Dict{Int, Array{Float64, 2}}()

    for id in 1:nthreads()
        d1[id] = Array{Float64, 2}(undef, 1000, 1000)
        d2[id] = Array{Float64, 2}(undef, 1000, 1000)
        d3[id] = Array{Float64, 2}(undef, 1000, 1000)
    end

    @threads for i in 1:iter
        id = threadid()
        Random.rand!(d1[id])
        Random.rand!(d2[id])
        broadcast!(/, d3[id], d1[id], d2[id])
    end
end

@time t(1000)

Which basically does the computations you want without freaking out the garbage collection code. First it allocates the memory for each thread to use and saves it in a Dict. Then the main loop initializes the d1 and d2 arrays to random values and performs the division saving it into the d3 array. Since all the arrays are kept throughout the entire run, no GC needs to happen, there is also a better chance or memory caching since the arrays are reused.

BMval · May 14, 2021, 2:20am

it doesn’t seem like a memory problem

I’ve run this code
@time @thread for i in 1 :10000 rand(100,100)/rand(100,100) end
1 core - 3.42 sec
2 core - 25.6 sec
4 core - 30.1 sec.

BMval · May 14, 2021, 2:33am

Yes, it was too long to wait for the result.

I am sorry, I can’t run your code right now, try to do it on a weekend.

mcabbott · May 14, 2021, 2:33am

Note that / is matrix division, so you might be timing some big LAPACK routine (in C or fortran or something) which is itself multi-threaded. Your code doesn’t run, but if I try to guess, ~~I get times two orders of magnitude different,~~ [with a bad guess!] ~~so it’s really unclear what you’re seeing.~~

@pixel27’s benchmark instead broadcasts, ./, which is handled by Julia. It seems to benefit from threads although less than linearly.

If you’re trying to figure out whether your Julia is running natively:

julia> versioninfo()
Julia Version 1.7.0-DEV.1102
Commit a0241b9226 (2021-05-13 23:27 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.4.0)   # will say x86_64-apple under rosetta
  CPU: Apple M1

BMval · May 14, 2021, 2:47am

I am sorry, I misspelled there is just simple for-loop.
Firstly, I just want to compare M1 and Intel, and nothing else. But have found that MultiThreads works very poor on M1
On Intel, MultiThreads works as expected.

Elrod · May 14, 2021, 3:12am

It’d be great if we could automatically limit ourselves to 4 threads on the M1; I’ve experienced the same issue of using >4 threads seriously regressing performance.
However, I haven’t gotten around to doing more serious testing, e.g. I’d think using @spawn with enough chunks should help performance at some point.

As mcabbot said, be sure to using LinearAlgebra; BLAS.set_num_threads(1) before running the benchmark.

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
 24.880300 seconds (3.28 M allocations: 37.438 GiB, 3.44% gc time, 0.06% compilation time)

julia> versioninfo()
Julia Version 1.7.0-DEV.1088
Commit 6cebd28e66* (2021-05-11 14:04 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

julia> BLAS.set_num_threads(4)

julia> @time for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
 26.165694 seconds (11.00 k allocations: 37.261 GiB, 3.40% gc time)

By restricting ourselves to 4 threads total (between both BLAS and Julia), these times are both much faster than those you reported in the opening post.

These times are good, especially as the M1 wasn’t even using Apple Accelerate, meaning it is missing out on their library using the special matrix instructions.
For comparison, MKL with 4 threads on a system with AVX512:

julia> using MKL
[ Info: Precompiling MKL [33e6dc65-8f57-5167-99aa-e5a354878fb2]

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
 13.029377 seconds (3.28 M allocations: 37.439 GiB, 2.82% gc time, 0.12% compilation time)

julia> versioninfo()
Julia Version 1.7.0-DEV.1082
Commit 6420bd5d63* (2021-05-10 13:16 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 4

julia> BLAS.set_num_threads(4)

julia> @time for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
 16.962086 seconds (11.00 k allocations: 37.261 GiB, 2.62% gc time)

julia> using VectorizedRNG

julia> BLAS.set_num_threads(1)

julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
 12.629314 seconds (38.23 k allocations: 37.262 GiB, 1.90% gc time, 0.14% compilation time)

julia> @time Threads.@threads for i in 1:1000 rand(local_rng(),1000,1000)/rand(local_rng(),1000,1000) end
 11.429838 seconds (35.28 k allocations: 37.262 GiB, 1.64% gc time, 0.15% compilation time)

julia> Threads.nthreads()
4

Using OpenBLAS instead results in a time similar to the M1’s:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> Threads.nthreads()
4

julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
 19.849157 seconds (3.28 M allocations: 37.439 GiB, 1.93% gc time, 0.10% compilation time)

@staticfloat had a gist somewhere for getting the M1 to use Accelerate.
Would be great to try that again and have a fair comparison with MKL.

staticfloat · May 14, 2021, 3:34pm

You can use BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate") to start forwarding to Accelerate, but it’s LP64, so you can’t use Julia’s native * to do GEMM (since that will expect ILP64, and will thus still dispatch to OpenBLAS), you need to write your own wrapper. The full gist has a demonstration.

Elrod · May 16, 2021, 6:48pm

Great, thanks.
I created AppleAccelerateLinAlgWrapper.jl for testing purposes.

julia> # using Pkg; Pkg.add("git@github.com:chriselrod/AppleAccelerateLinAlgWrapper.jl.git")

julia> using AppleAccelerateLinAlgWrapper

julia> @time for i in 1:1000 AppleAccelerateLinAlgWrapper.rdiv!(rand(1000,1000),rand(1000,1000)) end
 19.473893 seconds (5.00 k allocations: 14.905 GiB, 1.39% gc time)

julia> @time for i in 1:1000 AppleAccelerateLinAlgWrapper.rdiv!(rand(1000,1000),rand(1000,1000)) end
 19.712946 seconds (5.00 k allocations: 14.905 GiB, 1.63% gc time)

This is faster than before, but performance suffers for 1000x1000 matrices.

4 Firestorm cores have 16 MiB L2 cache (and don’t have an L3 cache). 2 1000x1000 Float64 matrices require 15 MiB, so this is about at the limit.

FWIW, the x86 CPUs I’ve tried do not reach a point where performance starts to fall, even once the matrices are too large to fit in the L3 cache. This also holds for the 10980XE, which is faster per core over the range of sampled sizes (500,1000,4000), so it isn’t a throughput vs memory bandwidth problem.

Because Accelerate falls behind OpenBLAS by 4000x4000, this suggests it is at least partly a implementation/algorithmic problem.

GEMM Benchmarks

julia> M = K = N = 1000; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.026790 seconds (2 allocations: 7.629 MiB)

julia> BLAS.set_num_threads(4)

julia> @benchmark mul!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.556 ms (0.00% GC)
  median time:      11.578 ms (0.00% GC)
  mean time:        11.579 ms (0.00% GC)
  maximum time:     11.808 ms (0.00% GC)
  --------------
  samples:          432
  evals/sample:     1

julia> @benchmark AppleAccelerateLinAlgWrapper.gemm!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.734 ms (0.00% GC)
  median time:      8.327 ms (0.00% GC)
  mean time:        8.317 ms (0.00% GC)
  maximum time:     8.911 ms (0.00% GC)
  --------------
  samples:          601
  evals/sample:     1

julia> 2e-9M*K*N ./ (11.556e-3, 7.745e-3)
(173.07026652821048, 258.2311168495804)

julia> M = K = N = 500; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.005097 seconds (2 allocations: 1.907 MiB)

julia> @benchmark mul!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.544 ms (0.00% GC)
  median time:      1.546 ms (0.00% GC)
  mean time:        1.547 ms (0.00% GC)
  maximum time:     1.668 ms (0.00% GC)
  --------------
  samples:          3231
  evals/sample:     1

julia> @benchmark AppleAccelerateLinAlgWrapper.gemm!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     776.583 μs (0.00% GC)
  median time:      788.250 μs (0.00% GC)
  mean time:        789.392 μs (0.00% GC)
  maximum time:     982.458 μs (0.00% GC)
  --------------
  samples:          6325
  evals/sample:     1

julia> 2e-9M*K*N ./ (1.544e-3, 776.583e-6)
(161.91709844559588, 321.92309128579956)

julia> @benchmark mul!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     751.998 ms (0.00% GC)
  median time:      753.514 ms (0.00% GC)
  mean time:        753.247 ms (0.00% GC)
  maximum time:     753.881 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1


julia> @benchmark AppleAccelerateLinAlgWrapper.gemm!($C0,$A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     913.586 ms (0.00% GC)
  median time:      996.914 ms (0.00% GC)
  mean time:        984.217 ms (0.00% GC)
  maximum time:     1.004 s (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

julia> 2e-9M*K*N ./ (751.998e-3, 913.586e-3)
(170.2132186521773, 140.1072258112537)

Topic		Replies	Views
Taking advantage of Apple M1? Performance mac-m1 , hardware	27	5425	November 10, 2023
Apple M1 GPU from Julia? GPU question	20	5870	March 31, 2023
Apple M1, M1 pro M1 Max and Julia developpers Offtopic	17	5414	November 1, 2021
JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster) General Usage jump , diffeq , apple	12	2756	December 5, 2021
LinearAlgebra.mul! for complex vectors very slow on Apple Silicon Performance performance , linearalgebra , mac-m1	5	337	November 8, 2024

Does Mac M1 in multithreads is slower that in single thread?

Related topics