I recently got a Mac Studio M4 Max, and I am having fun exploring its compute capabilities in Julia. In particular, I wanted to see how well it could take advantage of its substantial memory bandwidth (546 GB/s) in linear algebra.
Despite some claims that column major vs row major shouldn’t have a significant impact (What are the pros and cons of row/column major ordering?) , my testing showed quite a substantial performance improvement in using row major order:
using LinearAlgebra, BenchmarkTools
N= 16384
a = randn(N,N); b = randn(N); c = similar(b);
@btime mul!($c, $a, $b);
@btime mul!($c, $transpose(a), $b);
#Repeat for single thread
LinearAlgebra.BLAS.set_num_threads(1)
@btime mul!($c, $a, $b);
@btime mul!($c, $transpose(a), $b);
You should know that OpenBLAS, which is the default BLAS shipped with Julia, isn’t super well optimized for Apple Silicon yet. They’re working on it, but the instruction set for the matrix coprocessor was only open-sourced starting with the M4, so the effort to take full advantage of these processors only started recently.
You can try using AppleAccelerate to use Apple’s own BLAS instead. However, that seems to use the AMX coprocessors exclusively, rather than the P-cores, so the performance scaling may be different from non-linalg workloads. My M4 Pro seems to have 3 AMX coprocessors, but a single matmul only uses at most 2.
Just a note that you’re only interpolating the transpose function here. You probably wanted $(transpose(a)). (This is obviously irrelevant for this benchmark, the number crunching swamps that single load of an untyped global by many orders of magnitude.)
The performance with AppleAccelerate is actually slightly worse for row major order:
julia> using AppleAccelerate
julia> @btime mul!($c, $a, $b);
10.446 ms (0 allocations: 0 bytes)
julia> @btime mul!($c, $a, $b);
10.447 ms (0 allocations: 0 bytes)
julia> @btime mul!($c, $(transpose(a)), $b);
8.222 ms (0 allocations: 0 bytes)
julia> @btime mul!($c, $(transpose(a)), $b);
8.209 ms (0 allocations: 0 bytes)
I don’t think there are more than 2 AMX engines on the M4 Max. In my testing, by adjusting the VECLIB_MAXIMUM_THREADS variable I could only have it use 1 or 2 AMX engines. This was corroborated by the observation in the Activity Monitor that either 1 or 2 P-cores would be fully loaded, respectively. I asked a friend to run the same test on the M3 Ultra, and there we observed 4 AMX engines used, which is what we expected.
Did you try launching matmuls on multiple processes concurrently? That’s how I concluded that my M4 Pro has 3, and I’d be surprised if the M4 Max has fewer. But a single matmul definitely only uses 2, so you have to launch multiple at the same time from different processes (maybe different threads within the same process works too, I haven’t tried that).
Here’s what I’m doing from the fish shell (I guess bash or zsh has slightly different loop syntax, but you get the idea):
for i in (seq 1 3)
VECLIB_MAXIMUM_THREADS=1 julia --startup-file=no -e "using AppleAccelerate, InteractiveUtils; println(peakflops(2^13))" &
end
For 1, 2, and 3 concurrent processes, the printed numbers are similar: in the neighborhood of 3.9e11. When I increase to 4 the numbers start to drop.
EDIT: Hm, but there’s a substantial delay between the printing of the second and third numbers, so the above benchmark must be flawed. I suppose to get proper concurrent benchmarks you have to restrict to a single trial per process, otherwise one process’s last trial may occur after the others have returned. Taking this into account, you can clearly see the numbers drop when adding the third process:
❯ for i in (seq 1 2)
VECLIB_MAXIMUM_THREADS=1 julia --startup-file=no -e "using AppleAccelerate, InteractiveUtils; println(peakflops(2^13; ntrials=1))" &
end
3.545642116407109e11
3.5424043606968884e11
❯ for i in (seq 1 3)
VECLIB_MAXIMUM_THREADS=1 julia --startup-file=no -e "using AppleAccelerate, InteractiveUtils; println(peakflops(2^13; ntrials=1))" &
end
3.574504262199497e11
3.557026872100763e11
2.0444442089157233e11
So I stand corrected. The M4 Pro seems to have two AMX engines.