A Brief Study of Memory Bound Application Performance on M3 Ultra and M4 Max in Julia

Hello everybody!

I wanted to share my Linear Algebra performance findings in Julia on the M4 Max and M3 Ultra in double precision floating point. As a Julia novice, I was hugely impressed how easy it was for me to brainstorm ideas in Julia and run quick performance tests. I subsequently sanity checked them in C, and there was no significant difference in performance. Also, how easy it was to move from CPU to GPU execution. Thanks to everybody who contributed to Metal.jl and made this possible! In fact, I found the GPU matrix multiplication to perform significantly faster in Julia than Apple’s MLX.

One surprising finding was that the M4 Max actually outperformed the M3 Ultra in matrix-vector multiplication, which is a memory bound computation. My STREAM benchmark tests indicated that they are indeed very close in performance (but it did show a slight advantage to the M3 Ultra). It’s only when you move the computation to the GPU that the bandwidth advantage of the M3 Ultra is truly noticed.

I documented my findings in this video, and I used the roofline model to visualize performance expectations versus the Ryzen 9950X: https://youtu.be/dwYaFlnrFgA

Benchmark results (multi-threaded, P-cores only)
DGEMM, N = 8192

CPU Benchmarked in Julia Theoretical Peak
M1 4P 181 GFLOPS 205 GFLOPS
M4 Max 12P 698 GFLOPS 747 GFLOPS
M3 Ultra 20P 1073 GFLOPS 1145 GFLOPS
Ryzen 9950X 1763 GFLOPS 1946 GFLOPS

DGEMV (P-cores only), N=8192

CPU Benchmarked in Julia Achieved Memory BW
M1 4P 14.5 GFLOPS 58 GB/s
M4 Max 12P 72 GFLOPS 275 GB/s
M3 Ultra 20P 65 GFLOPS 261 GB/s
Ryzen 9950X 15 GFLOPS 60 GB/s
10980XE 19 GFLOPS 76 GB/s
8 Likes

Are your tests single-threaded or multi-threaded? Could you summarize your results on matrix-matrix multiplications? Thanks.

1 Like

I updated my original post with a summary of the findings.

All tests were multi-threaded and executed on P-cores only.

2 Likes