Matrix vector multiplication: impact of column major vs row major (M4 Max)

PetarM · April 10, 2025, 4:26pm

Hello!

I recently got a Mac Studio M4 Max, and I am having fun exploring its compute capabilities in Julia. In particular, I wanted to see how well it could take advantage of its substantial memory bandwidth (546 GB/s) in linear algebra.

Despite some claims that column major vs row major shouldn’t have a significant impact (What are the pros and cons of row/column major ordering?) , my testing showed quite a substantial performance improvement in using row major order:

using LinearAlgebra, BenchmarkTools 
N= 16384 
a = randn(N,N); b = randn(N); c = similar(b); 
@btime mul!($c, $a, $b); 
@btime mul!($c, $transpose(a), $b); 

#Repeat for single thread 
LinearAlgebra.BLAS.set_num_threads(1) 
@btime mul!($c, $a, $b);
@btime mul!($c, $transpose(a), $b);

Multithreaded on 12 P-cores:
Column major: 10.121ms → 202 GB/s matrix read speed
Row major: 7.44 → 275 GB/s matrix read speed

Single thread:
Column major: 43.6 ms → 47 GB/s matrix read speed
Row major: 35.4ms → 58 GB/s matrix read speed

Just for illustration, single threaded sum:

[julia> @btime sum(a)
22.573 ms (1 allocation: 16 bytes)

That’s about 90GB/s.

In the STREAM benchmark (BandwidthBenchmark.jl), this system managed 402 GB/s on 12 P-cores in the Update kernel and 321 GB/s in the Triad kernel.

JADekker · April 10, 2025, 4:32pm

Are you benchmarking in global scope? Can you try putting the calls in a function?

danielwe · April 10, 2025, 4:39pm

You should know that OpenBLAS, which is the default BLAS shipped with Julia, isn’t super well optimized for Apple Silicon yet. They’re working on it, but the instruction set for the matrix coprocessor was only open-sourced starting with the M4, so the effort to take full advantage of these processors only started recently.

You can try using AppleAccelerate to use Apple’s own BLAS instead. However, that seems to use the AMX coprocessors exclusively, rather than the P-cores, so the performance scaling may be different from non-linalg workloads. My M4 Pro seems to have 3 AMX coprocessors, but a single matmul only uses at most 2.

Just a note that you’re only interpolating the transpose function here. You probably wanted $(transpose(a)). (This is obviously irrelevant for this benchmark, the number crunching swamps that single load of an untyped global by many orders of magnitude.)

PetarM · April 10, 2025, 4:52pm

The performance with AppleAccelerate is actually slightly worse for row major order:

julia> using AppleAccelerate

julia> @btime mul!($c, $a, $b);

10.446 ms (0 allocations: 0 bytes)

julia> @btime mul!($c, $a, $b);

10.447 ms (0 allocations: 0 bytes)

julia> @btime mul!($c, $(transpose(a)), $b);

8.222 ms (0 allocations: 0 bytes)

julia> @btime mul!($c, $(transpose(a)), $b);

8.209 ms (0 allocations: 0 bytes)

I don’t think there are more than 2 AMX engines on the M4 Max. In my testing, by adjusting the VECLIB_MAXIMUM_THREADS variable I could only have it use 1 or 2 AMX engines. This was corroborated by the observation in the Activity Monitor that either 1 or 2 P-cores would be fully loaded, respectively. I asked a friend to run the same test on the M3 Ultra, and there we observed 4 AMX engines used, which is what we expected.

danielwe · April 10, 2025, 5:00pm

Did you try launching matmuls on multiple processes concurrently? That’s how I concluded that my M4 Pro has 3, and I’d be surprised if the M4 Max has fewer. But a single matmul definitely only uses 2, so you have to launch multiple at the same time from different processes (maybe different threads within the same process works too, I haven’t tried that).

stevengj · April 10, 2025, 5:07pm

The $ interpolation with @btime should eliminate that concern (modulo typos in the $ placement as noted by @danielwe above).

danielwe · April 10, 2025, 5:19pm

~~Here’s what I’m doing from the fish shell (I guess bash or zsh has slightly different loop syntax, but you get the idea):~~

for i in (seq 1 3)
    VECLIB_MAXIMUM_THREADS=1 julia --startup-file=no -e "using AppleAccelerate, InteractiveUtils; println(peakflops(2^13))" &
end

~~For 1, 2, and 3 concurrent processes, the printed numbers are similar: in the neighborhood of 3.9e11. When I increase to 4 the numbers start to drop.~~

EDIT: Hm, but there’s a substantial delay between the printing of the second and third numbers, so the above benchmark must be flawed. I suppose to get proper concurrent benchmarks you have to restrict to a single trial per process, otherwise one process’s last trial may occur after the others have returned. Taking this into account, you can clearly see the numbers drop when adding the third process:

❯ for i in (seq 1 2)
      VECLIB_MAXIMUM_THREADS=1 julia --startup-file=no -e "using AppleAccelerate, InteractiveUtils; println(peakflops(2^13; ntrials=1))" &
  end
3.545642116407109e11
3.5424043606968884e11

❯ for i in (seq 1 3)
      VECLIB_MAXIMUM_THREADS=1 julia --startup-file=no -e "using AppleAccelerate, InteractiveUtils; println(peakflops(2^13; ntrials=1))" &
  end
3.574504262199497e11
3.557026872100763e11
2.0444442089157233e11

So I stand corrected. The M4 Pro seems to have two AMX engines.

Topic		Replies	Views
Effect of Column Major Storage on Multiplication Performance	3	446	November 7, 2020
Would matrix multiplication benefit if Julia allowed for row-major arrays as well? Internals & Design question , proposal , column-major	1	908	September 20, 2018
A Brief Study of Memory Bound Application Performance on M3 Ultra and M4 Max in Julia Performance linearalgebra	2	197	April 27, 2025
LinearAlgebra.mul! for complex vectors very slow on Apple Silicon Performance performance , linearalgebra , mac-m1	5	334	November 8, 2024
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	413	November 13, 2023

Matrix vector multiplication: impact of column major vs row major (M4 Max)

Related topics