It’d be great if we could automatically limit ourselves to 4 threads on the M1; I’ve experienced the same issue of using >4 threads seriously regressing performance.
However, I haven’t gotten around to doing more serious testing, e.g. I’d think using @spawn
with enough chunks should help performance at some point.
As mcabbot said, be sure to using LinearAlgebra; BLAS.set_num_threads(1)
before running the benchmark.
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
24.880300 seconds (3.28 M allocations: 37.438 GiB, 3.44% gc time, 0.06% compilation time)
julia> versioninfo()
Julia Version 1.7.0-DEV.1088
Commit 6cebd28e66* (2021-05-11 14:04 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.3.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 4
julia> BLAS.set_num_threads(4)
julia> @time for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
26.165694 seconds (11.00 k allocations: 37.261 GiB, 3.40% gc time)
By restricting ourselves to 4 threads total (between both BLAS and Julia), these times are both much faster than those you reported in the opening post.
These times are good, especially as the M1 wasn’t even using Apple Accelerate, meaning it is missing out on their library using the special matrix instructions.
For comparison, MKL with 4 threads on a system with AVX512:
julia> using MKL
[ Info: Precompiling MKL [33e6dc65-8f57-5167-99aa-e5a354878fb2]
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
13.029377 seconds (3.28 M allocations: 37.439 GiB, 2.82% gc time, 0.12% compilation time)
julia> versioninfo()
Julia Version 1.7.0-DEV.1082
Commit 6420bd5d63* (2021-05-10 13:16 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 4
julia> BLAS.set_num_threads(4)
julia> @time for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
16.962086 seconds (11.00 k allocations: 37.261 GiB, 2.62% gc time)
julia> using VectorizedRNG
julia> BLAS.set_num_threads(1)
julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
12.629314 seconds (38.23 k allocations: 37.262 GiB, 1.90% gc time, 0.14% compilation time)
julia> @time Threads.@threads for i in 1:1000 rand(local_rng(),1000,1000)/rand(local_rng(),1000,1000) end
11.429838 seconds (35.28 k allocations: 37.262 GiB, 1.64% gc time, 0.15% compilation time)
julia> Threads.nthreads()
4
Using OpenBLAS instead results in a time similar to the M1’s:
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> Threads.nthreads()
4
julia> @time Threads.@threads for i in 1:1000 rand(1000,1000)/rand(1000,1000) end
19.849157 seconds (3.28 M allocations: 37.439 GiB, 1.93% gc time, 0.10% compilation time)
@staticfloat had a gist somewhere for getting the M1 to use Accelerate.
Would be great to try that again and have a fair comparison with MKL.