LinearAlgebra.mul! for complex vectors very slow on Apple Silicon

goerz · November 8, 2024, 4:06am

I usually run my computations on a Linux workstation via SSH from my MacBook. However, the specs of the recent M4 Macbook Pro would seem to promise rather extreme performance.

But, just testing out basic matrix-vector multiplication (which typically dominate my numerics) on my current M1, I’m seeing an extreme, order of magnitude slowdown.

Here is the benchmark on the Linux workstation:

> JULIA_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 julia

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  JULIA_NUM_THREADS = 1
  LD_LIBRARY_PATH = /home/goerz/.local/lib
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true

julia> using Pkg

julia> Pkg.status()
Status `~/.julia/environments/v1.11/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [31a5f54b] Debugger v0.7.10
⌃ [7073ff75] IJulia v1.25.0
  [5903a43b] Infiltrator v1.8.3
  [c3a54625] JET v0.9.12
⌃ [98e50ef6] JuliaFormatter v1.0.60
⌃ [295af30f] Revise v3.6.0
Info Packages marked with ⌃ have new versions available and may be upgradable.

julia> using LinearAlgebra

julia> N = 100;

julia> v = rand(N) + rand(N) * 1im;

julia> H = rand(N, N);

julia> v2 = similar(v);

julia> using BenchmarkTools

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.408 μs …   5.375 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.420 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.439 μs ± 115.613 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄▁             ▂▁                                         ▁
  ████▆▁▁▁▃▁▁▁▁▁▁▁▅██▄▅▅▅▁▃▁▃▄▃▁▁▁▃▁▃▅▅▅▅▄▅▅▄▃▅▇█▅▆▅▄▅▅▄▅▄▅▄▄ █
  2.41 μs      Histogram: log(frequency) by time       2.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, on the M1 Macbook:

> JULIA_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 julia

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EXCLUSIVE = 1
  JULIA_NUM_THREADS = 1
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true

julia> using Pkg

julia> Pkg.status()
Status `~/.julia/environments/v1.11/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
⌃ [31a5f54b] Debugger v0.7.8
⌃ [7073ff75] IJulia v1.24.2
⌃ [5903a43b] Infiltrator v1.7.0
⌃ [c3a54625] JET v0.9.10
  [afaeafb7] MuxDisplay v0.1.0-dev `../../../Documents/Programming/MuxDisplay.jl`
⌃ [14b8a8f1] PkgTemplates v0.7.48
⌃ [295af30f] Revise v3.6.0
⌃ [1e6cf692] TestEnv v1.101.1
Info Packages marked with ⌃ have new versions available and may be upgradable.

julia> using LinearAlgebra

julia> N = 100;

julia> v = rand(N) + rand(N) * 1im;

julia> H = rand(N, N);

julia> v2 = similar(v);

julia> using BenchmarkTools

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.666 μs … 124.125 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.958 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.389 μs ±   3.163 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▁▃█▆▁▂▇▄▁                                                   ▂
  ███████████▆▆▆▆▅▄▄▄▄▆▅▄▄▅▃▄▅▄▃▁▆▅▄▁▄▅▄▁▄▃▄▄▁▃▃▃▁▃▄▄▄▃▄▄▃▅▄▆▇ █
  37.7 μs       Histogram: log(frequency) by time      57.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The runtime is more than an order of magnitude higher, and that seems to translate directly into the runtime of higher-level benchmarks.

What gives? Am I missing something fundamental on the macOS side?

Ralph_Smith · November 8, 2024, 4:54am

What happens if you use AppleAccelerate.jl for BLAS?

LaurentPlagne · November 8, 2024, 7:23am

I can reproduce.
It seems to be related to the complex type (the time is 1.4 microsecond for the Float64 case).

LaurentPlagne · November 8, 2024, 7:34am

AppleAccelerate.jl does indeed accelerate.

The complex case on M1:

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.625 μs … 59.125 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.750 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.979 μs ±  1.240 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▂                    ▁▁                                   ▁
  ███▆▁▁▁▄▁▁▁▃▁▁▁▃▁▁▃▄▄▁▁██▅▄▃▅▃▁▃▃▃▃▄▁▃▁▁▃▆▆▄▁▁▄▁▅▅▄▄▅▇▅▆▅▆▇ █
  37.6 μs      Histogram: log(frequency) by time      44.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> using AppleAccelerate

julia> @benchmark mul!($v2, $H, $v)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.667 μs …  3.646 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.679 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.686 μs ± 72.234 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃    ▇    █     ▆    ▃    ▃    ▂                      ▂
  ▄▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▆▁▁▁▁▄▁▁▁▁▃ █
  1.67 μs      Histogram: log(frequency) by time     1.71 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

goerz · November 8, 2024, 2:55pm

Yeah, unfortunately state vectors in quantum mechanics are complex (while operators can be real-valued or complex)…

Indeed, and that speedup translates to the upstream benchmarks as well. Small caveat: it seems I have to set the environment variable VECLIB_MAXIMUM_THREADS=1 to ensure no multi-threading happens inside BLAS. But the speedup still holds up, and the total performance is now comparable to the Linux workstation (up to the expected differences in baseline clockspeed).

An interesting observation is that (if htop and similar tools are to be believed on macOS), even though I can run the process at a constant 100% CPU usage, that usage does not at all seemed to be pinned to a single processor on the MacBook. On Linux, I’m seeing the process occupy a single core (and that’s without using specialized tools like ThreadPinning). I suppose I shouldn’t care if the wallclock time is fine. And, thread pinning isn’t even possible on macOS, right?

More than an order of magnitude difference for BLAS operations between vanilla LinearAlgebra and loading a platform-specific accelerator seems extreme, though. Should this be considered a bug? If so, where should I report it? The main Julia issue tracker, probably?

For comparison, whether or not I load MKL on Linux has a much more negligible effect.

If using AppleAccelerate explicitly is going to have to be a requirement for usable performance, I’m going to have to think about how to handle that in my library code.

Is there a possibility that this kind of acceleration can be automatic? That is, Julia could use accelerated linear algebra just by virtue of me running the arm64-apple version of Julia? That would be ideal, of course, since I’m not overly enthusiastic about the prospect of having to add platform-specific “precompiler-statements” to my library, or to benchmarking scripts.

The issue

github.com/JuliaLang/julia

macOS Accelerate Optimizations and Julia versus Python Benchmarks on Apple Silicon

opened 09:24PM - 05 Aug 23 UTC

essandess

performance system:apple silicon

[[Solution](https://github.com/JuliaLang/julia/issues/50806#issuecomment-1666733…607) to this issue: Use [AppleAccelerate.jl](https://github.com/JuliaLinearAlgebra/AppleAccelerate.jl).] I'm a maintainer for the MacPorts `julia` port and am trying to make sure we have the best build formula for Apple Silicon. I see a ≈2.5–3× difference in the CPU performance of Julia versus Python on basic dense matrix operations, which suggests that we may not be using the Accelerate framework appropriately. (Metal.jl benchmarks are comparable within a few TFLOPS to to PyTorch+MPS, so at least that part looks okay.) How does one ensure that Julia is compiled to use macOS Accelerate optimizations? We use the [direct](https://github.com/JuliaLang/julia/blob/master/Makefile) build instructions provided by Julia itself, so this performance issue may arise from Julia. https://github.com/macports/macports-ports/blob/0f6d1c42dfc3bda20673e34529c51ab34a4f3da4/lang/julia/Portfile#L57-L58 On a Mac Studio M2 Ultra, I observe that Numpy with Accelerate achieves about 2.5–3 TFLOPS for dense ops, but Julia achieves 1–1.4 TFLOPS, using both MacPorts and julialang.org binaries. Here's some basic benchmarking code and results: #### Benchmarks on Mac Studio M2 Ultra <details> <summary>Julia Benchmark Code</summary> ##### Julia Benchmark Code ```julia # using AppleAccelerate using BenchmarkTools using Metal using Printf j_type = Float32 for sz in [2048, 4096, 8192, 16384] a = randn(j_type, sz, sz); b = randn(j_type, sz, sz); a_mtl = MtlArray(a); b_mtl = MtlArray(b); ts = @belapsed $a * $b; ts_mtl = @belapsed ($a_mtl * $b_mtl)[1, 1]; @printf("| %d\t| %.1f\t| %.1f\t|\n", sz, sz^2*(2*sz - 1) / ts / 1e9, sz^2*(2*sz - 1) / ts_mtl / 1e9) end ``` </details> ##### Julia (MacPorts) Matrix Multiplication (GFLOPS) | Size | Julia | Metal.jl | | -----: | -----: | -----: | | 2048 | 1068.3 | 11071.3 | | 4096 | 1168.5 | 16652.6 | | 8192 | 1350.6 | 18281.8 | | 16384 | 1353.1 | 17988.2 | ##### Julia (julialang.org) Matrix Multiplication (GFLOPS) | Size | Julia | Metal.jl | | -----: | -----: | -----: | | 2048 | 962.1 | 10760.9 | | 4096 | 1162.4 | 16134.1 | | 8192 | 1348.3 | 17379.8 | | 16384 | 1322.1 | 17831.5 | ##### Julia ([AppleAccelerate.jl](https://github.com/JuliaLinearAlgebra/AppleAccelerate.jl)) Matrix Multiplication (GFLOPS) | Size | [AppleAccelerate.jl](https://github.com/JuliaLinearAlgebra/AppleAccelerate.jl) | Metal.jl | | -----: | -----: | -----: | | 2048 | 3301.1 | 10474.0 | | 4096 | 3588.8 | 16004.8 | | 8192 | 4018.3 | 17385.5 | | 16384 | 4187.6 | 17944.2 | <details> <summary>Python Benchmark Code</summary> ##### Python Benchmark Code ```python import numpy as np import torch import time mpsDevice = torch.device("mps" if torch.backends.mps.is_available() else "cpu") rg = np.random.default_rng(1) np_type = np.float32 torch_type = torch.float32 print("Python Matrix Multiplication (GFLOPS)\n") print("| Size\t| Numpy+Accelerate \t| PyTorch+MPS |") print("| -----:\t| -----:\t| -----: |") for size in (2048, 4096, 8192, 16384): a_np = rg.random((size, size), dtype=np_type) b_np = rg.random((size, size), dtype=np_type) a_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice) b_torch = torch.randn((size, size), dtype=torch_type, device=mpsDevice) ts_np = %timeit -n1 -r5 -q -o a_np @ b_np ts_torch = %timeit -n1 -r5 -q -o (a_torch @ b_torch)[0, 0].cpu() print("| {:d}\t| {:.1f}\t| {:.1f} |".format(size, size**2*(2*size - 1) / np.median(ts_np.all_runs) / 1e9, size**2*(2*size - 1) / np.median(ts_torch.all_runs) / 1e9)) ``` </details> ##### Python Matrix Multiplication (GFLOPS) | Size | Numpy+Accelerate | PyTorch+MPS | | -----: | -----: | -----: | | 2048 | 2134.5 | 10679.2 | | 4096 | 2626.8 | 20309.6 | | 8192 | 2845.0 | 20988.9 | | 16384 | 3015.1 | 19577.4 | ```julia julia> versioninfo() Julia Version 1.9.2 Commit e4ee485e90 (2023-07-05 09:39 UTC) Platform Info: OS: macOS (arm64-apple-darwin22.5.0) CPU: 24 × Apple M2 Ultra WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1) Threads: 1 on 16 virtual cores ```

seems to indicate that this might be a possibility.

Are there other relevant open issues that I didn’t find?

LaurentPlagne · November 8, 2024, 4:00pm

I don’t know much about the internals of AA but I think I remember that it may rely on specialized part of the chip (AMX?) at least for gemm.

A quick workaround could be to convert your real matrix operator to a complex one which seems to be performant without AA:

...
 Hi=Matrix{ComplexF64}(H)
 @benchmark mul!($v2, $Hi, $v)

julia> @benchmark mul!($v2, $Hi, $v)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.010 μs …   7.208 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.198 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.218 μs ± 137.120 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▃▄▄█▇▂▄▃                                            
  ▂▂▂▂▃▄▅▅██████████▅▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  3.01 μs         Histogram: frequency by time        3.85 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Topic		Replies	Views
Apple M1 GPU from Julia? GPU question	20	5876	March 31, 2023
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3533	May 16, 2021
JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster) General Usage jump , diffeq , apple	12	2758	December 5, 2021
Apple silicon full power Performance hardware , apple	19	6685	November 18, 2021
Why is this simple function twice as slow as its Python version Performance question	97	4435	April 12, 2021

LinearAlgebra.mul! for complex vectors very slow on Apple Silicon

Related topics