I usually run my computations on a Linux workstation via SSH from my MacBook. However, the specs of the recent M4 Macbook Pro would seem to promise rather extreme performance.
But, just testing out basic matrix-vector multiplication (which typically dominate my numerics) on my current M1, Iβm seeing an extreme, order of magnitude slowdown.
Here is the benchmark on the Linux workstation:
> JULIA_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 julia
julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 Γ Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
JULIA_NUM_THREADS = 1
LD_LIBRARY_PATH = /home/goerz/.local/lib
JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
julia> using Pkg
julia> Pkg.status()
Status `~/.julia/environments/v1.11/Project.toml`
[6e4b80f9] BenchmarkTools v1.5.0
[31a5f54b] Debugger v0.7.10
β [7073ff75] IJulia v1.25.0
[5903a43b] Infiltrator v1.8.3
[c3a54625] JET v0.9.12
β [98e50ef6] JuliaFormatter v1.0.60
β [295af30f] Revise v3.6.0
Info Packages marked with β have new versions available and may be upgradable.
julia> using LinearAlgebra
julia> N = 100;
julia> v = rand(N) + rand(N) * 1im;
julia> H = rand(N, N);
julia> v2 = similar(v);
julia> using BenchmarkTools
julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max): 2.408 ΞΌs β¦ 5.375 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.420 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 2.439 ΞΌs Β± 115.613 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββ ββ β
βββββββββββββββββ
ββββ
β
β
βββββββββββββ
β
β
β
ββ
β
βββ
βββ
ββ
ββ
β
ββ
ββ
ββ β
2.41 ΞΌs Histogram: log(frequency) by time 2.9 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Now, on the M1 Macbook:
> JULIA_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 julia
julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 10 Γ Apple M1 Pro
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
JULIA_EXCLUSIVE = 1
JULIA_NUM_THREADS = 1
JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
julia> using Pkg
julia> Pkg.status()
Status `~/.julia/environments/v1.11/Project.toml`
[6e4b80f9] BenchmarkTools v1.5.0
β [31a5f54b] Debugger v0.7.8
β [7073ff75] IJulia v1.24.2
β [5903a43b] Infiltrator v1.7.0
β [c3a54625] JET v0.9.10
[afaeafb7] MuxDisplay v0.1.0-dev `../../../Documents/Programming/MuxDisplay.jl`
β [14b8a8f1] PkgTemplates v0.7.48
β [295af30f] Revise v3.6.0
β [1e6cf692] TestEnv v1.101.1
Info Packages marked with β have new versions available and may be upgradable.
julia> using LinearAlgebra
julia> N = 100;
julia> v = rand(N) + rand(N) * 1im;
julia> H = rand(N, N);
julia> v2 = similar(v);
julia> using BenchmarkTools
julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 37.666 ΞΌs β¦ 124.125 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 38.958 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 39.389 ΞΌs Β± 3.163 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββ β
ββββββββββββββββ
ββββββ
βββ
βββ
βββββ
ββββ
ββββββββββββββββββββ
βββ β
37.7 ΞΌs Histogram: log(frequency) by time 57.1 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
The runtime is more than an order of magnitude higher, and that seems to translate directly into the runtime of higher-level benchmarks.
What gives? Am I missing something fundamental on the macOS side?