Hello everybody!
I wanted to share my Linear Algebra performance findings in Julia on the M4 Max and M3 Ultra in double precision floating point. As a Julia novice, I was hugely impressed how easy it was for me to brainstorm ideas in Julia and run quick performance tests. I subsequently sanity checked them in C, and there was no significant difference in performance. Also, how easy it was to move from CPU to GPU execution. Thanks to everybody who contributed to Metal.jl and made this possible! In fact, I found the GPU matrix multiplication to perform significantly faster in Julia than Apple’s MLX.
One surprising finding was that the M4 Max actually outperformed the M3 Ultra in matrix-vector multiplication, which is a memory bound computation. My STREAM benchmark tests indicated that they are indeed very close in performance (but it did show a slight advantage to the M3 Ultra). It’s only when you move the computation to the GPU that the bandwidth advantage of the M3 Ultra is truly noticed.
I documented my findings in this video, and I used the roofline model to visualize performance expectations versus the Ryzen 9950X: https://youtu.be/dwYaFlnrFgA
Benchmark results (multi-threaded, P-cores only)
DGEMM, N = 8192
CPU | Benchmarked in Julia | Theoretical Peak |
---|---|---|
M1 4P | 181 GFLOPS | 205 GFLOPS |
M4 Max 12P | 698 GFLOPS | 747 GFLOPS |
M3 Ultra 20P | 1073 GFLOPS | 1145 GFLOPS |
Ryzen 9950X | 1763 GFLOPS | 1946 GFLOPS |
DGEMV (P-cores only), N=8192
CPU | Benchmarked in Julia | Achieved Memory BW |
---|---|---|
M1 4P | 14.5 GFLOPS | 58 GB/s |
M4 Max 12P | 72 GFLOPS | 275 GB/s |
M3 Ultra 20P | 65 GFLOPS | 261 GB/s |
Ryzen 9950X | 15 GFLOPS | 60 GB/s |
10980XE | 19 GFLOPS | 76 GB/s |