LinearAlgebra.mul! for complex vectors very slow on Apple Silicon

I usually run my computations on a Linux workstation via SSH from my MacBook. However, the specs of the recent M4 Macbook Pro would seem to promise rather extreme performance.

But, just testing out basic matrix-vector multiplication (which typically dominate my numerics) on my current M1, I’m seeing an extreme, order of magnitude slowdown.

Here is the benchmark on the Linux workstation:

> JULIA_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 julia

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 Γ— Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  JULIA_NUM_THREADS = 1
  LD_LIBRARY_PATH = /home/goerz/.local/lib
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true

julia> using Pkg

julia> Pkg.status()
Status `~/.julia/environments/v1.11/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [31a5f54b] Debugger v0.7.10
βŒƒ [7073ff75] IJulia v1.25.0
  [5903a43b] Infiltrator v1.8.3
  [c3a54625] JET v0.9.12
βŒƒ [98e50ef6] JuliaFormatter v1.0.60
βŒƒ [295af30f] Revise v3.6.0
Info Packages marked with βŒƒ have new versions available and may be upgradable.

julia> using LinearAlgebra

julia> N = 100;

julia> v = rand(N) + rand(N) * 1im;

julia> H = rand(N, N);

julia> v2 = similar(v);

julia> using BenchmarkTools

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.408 ΞΌs …   5.375 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.420 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.439 ΞΌs Β± 115.613 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–ˆβ–„β–             ▂▁                                         ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–β–β–β–ƒβ–β–β–β–β–β–β–β–…β–ˆβ–ˆβ–„β–…β–…β–…β–β–ƒβ–β–ƒβ–„β–ƒβ–β–β–β–ƒβ–β–ƒβ–…β–…β–…β–…β–„β–…β–…β–„β–ƒβ–…β–‡β–ˆβ–…β–†β–…β–„β–…β–…β–„β–…β–„β–…β–„β–„ β–ˆ
  2.41 ΞΌs      Histogram: log(frequency) by time       2.9 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, on the M1 Macbook:

> JULIA_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 julia

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 Γ— Apple M1 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EXCLUSIVE = 1
  JULIA_NUM_THREADS = 1
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true

julia> using Pkg

julia> Pkg.status()
Status `~/.julia/environments/v1.11/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
βŒƒ [31a5f54b] Debugger v0.7.8
βŒƒ [7073ff75] IJulia v1.24.2
βŒƒ [5903a43b] Infiltrator v1.7.0
βŒƒ [c3a54625] JET v0.9.10
  [afaeafb7] MuxDisplay v0.1.0-dev `../../../Documents/Programming/MuxDisplay.jl`
βŒƒ [14b8a8f1] PkgTemplates v0.7.48
βŒƒ [295af30f] Revise v3.6.0
βŒƒ [1e6cf692] TestEnv v1.101.1
Info Packages marked with βŒƒ have new versions available and may be upgradable.

julia> using LinearAlgebra

julia> N = 100;

julia> v = rand(N) + rand(N) * 1im;

julia> H = rand(N, N);

julia> v2 = similar(v);

julia> using BenchmarkTools

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.666 ΞΌs … 124.125 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     38.958 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   39.389 ΞΌs Β±   3.163 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–β–ƒβ–ˆβ–†β–β–‚β–‡β–„β–                                                   β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–†β–…β–„β–„β–„β–„β–†β–…β–„β–„β–…β–ƒβ–„β–…β–„β–ƒβ–β–†β–…β–„β–β–„β–…β–„β–β–„β–ƒβ–„β–„β–β–ƒβ–ƒβ–ƒβ–β–ƒβ–„β–„β–„β–ƒβ–„β–„β–ƒβ–…β–„β–†β–‡ β–ˆ
  37.7 ΞΌs       Histogram: log(frequency) by time      57.1 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The runtime is more than an order of magnitude higher, and that seems to translate directly into the runtime of higher-level benchmarks.

What gives? Am I missing something fundamental on the macOS side?

What happens if you use AppleAccelerate.jl for BLAS?

1 Like

I can reproduce.
It seems to be related to the complex type (the time is 1.4 microsecond for the Float64 case).

AppleAccelerate.jl does indeed accelerate.

The complex case on M1:

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.625 ΞΌs … 59.125 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     37.750 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   37.979 ΞΌs Β±  1.240 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–ˆβ–‚                    ▁▁                                   ▁
  β–ˆβ–ˆβ–ˆβ–†β–β–β–β–„β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–ƒβ–„β–„β–β–β–ˆβ–ˆβ–…β–„β–ƒβ–…β–ƒβ–β–ƒβ–ƒβ–ƒβ–ƒβ–„β–β–ƒβ–β–β–ƒβ–†β–†β–„β–β–β–„β–β–…β–…β–„β–„β–…β–‡β–…β–†β–…β–†β–‡ β–ˆ
  37.6 ΞΌs      Histogram: log(frequency) by time      44.6 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> using AppleAccelerate

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.667 ΞΌs …  3.646 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.679 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.686 ΞΌs Β± 72.234 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–ƒ    β–‡    β–ˆ     β–†    β–ƒ    β–ƒ    β–‚                      β–‚
  β–„β–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–β–ˆβ–β–β–β–β–†β–β–β–β–β–„β–β–β–β–β–ƒ β–ˆ
  1.67 ΞΌs      Histogram: log(frequency) by time     1.71 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Yeah, unfortunately state vectors in quantum mechanics are complex (while operators can be real-valued or complex)…

Indeed, and that speedup translates to the upstream benchmarks as well. Small caveat: it seems I have to set the environment variable VECLIB_MAXIMUM_THREADS=1 to ensure no multi-threading happens inside BLAS. But the speedup still holds up, and the total performance is now comparable to the Linux workstation (up to the expected differences in baseline clockspeed).

An interesting observation is that (if htop and similar tools are to be believed on macOS), even though I can run the process at a constant 100% CPU usage, that usage does not at all seemed to be pinned to a single processor on the MacBook. On Linux, I’m seeing the process occupy a single core (and that’s without using specialized tools like ThreadPinning). I suppose I shouldn’t care if the wallclock time is fine. And, thread pinning isn’t even possible on macOS, right?

More than an order of magnitude difference for BLAS operations between vanilla LinearAlgebra and loading a platform-specific accelerator seems extreme, though. Should this be considered a bug? If so, where should I report it? The main Julia issue tracker, probably?

For comparison, whether or not I load MKL on Linux has a much more negligible effect.

If using AppleAccelerate explicitly is going to have to be a requirement for usable performance, I’m going to have to think about how to handle that in my library code.

Is there a possibility that this kind of acceleration can be automatic? That is, Julia could use accelerated linear algebra just by virtue of me running the arm64-apple version of Julia? That would be ideal, of course, since I’m not overly enthusiastic about the prospect of having to add platform-specific β€œprecompiler-statements” to my library, or to benchmarking scripts.

The issue

seems to indicate that this might be a possibility.

Are there other relevant open issues that I didn’t find?

2 Likes

I don’t know much about the internals of AA but I think I remember that it may rely on specialized part of the chip (AMX?) at least for gemm.

A quick workaround could be to convert your real matrix operator to a complex one which seems to be performant without AA:

...
 Hi=Matrix{ComplexF64}(H)
 @benchmark mul!($v2, $Hi, $v)

julia> @benchmark mul!($v2, $Hi, $v)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.010 ΞΌs …   7.208 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.198 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.218 ΞΌs Β± 137.120 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

          β–β–ƒβ–„β–„β–ˆβ–‡β–‚β–„β–ƒ                                            
  β–‚β–‚β–‚β–‚β–ƒβ–„β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  3.01 ΞΌs         Histogram: frequency by time        3.85 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes