A Brief Study of Memory Bound Application Performance on M3 Ultra and M4 Max in Julia

PetarM · April 27, 2025, 10:35am

Hello everybody!

I wanted to share my Linear Algebra performance findings in Julia on the M4 Max and M3 Ultra in double precision floating point. As a Julia novice, I was hugely impressed how easy it was for me to brainstorm ideas in Julia and run quick performance tests. I subsequently sanity checked them in C, and there was no significant difference in performance. Also, how easy it was to move from CPU to GPU execution. Thanks to everybody who contributed to Metal.jl and made this possible! In fact, I found the GPU matrix multiplication to perform significantly faster in Julia than Apple’s MLX.

One surprising finding was that the M4 Max actually outperformed the M3 Ultra in matrix-vector multiplication, which is a memory bound computation. My STREAM benchmark tests indicated that they are indeed very close in performance (but it did show a slight advantage to the M3 Ultra). It’s only when you move the computation to the GPU that the bandwidth advantage of the M3 Ultra is truly noticed.

I documented my findings in this video, and I used the roofline model to visualize performance expectations versus the Ryzen 9950X: https://youtu.be/dwYaFlnrFgA

Benchmark results (multi-threaded, P-cores only)
DGEMM, N = 8192

CPU	Benchmarked in Julia	Theoretical Peak
M1 4P	181 GFLOPS	205 GFLOPS
M4 Max 12P	698 GFLOPS	747 GFLOPS
M3 Ultra 20P	1073 GFLOPS	1145 GFLOPS
Ryzen 9950X	1763 GFLOPS	1946 GFLOPS

DGEMV (P-cores only), N=8192

CPU	Benchmarked in Julia	Achieved Memory BW
M1 4P	14.5 GFLOPS	58 GB/s
M4 Max 12P	72 GFLOPS	275 GB/s
M3 Ultra 20P	65 GFLOPS	261 GB/s
Ryzen 9950X	15 GFLOPS	60 GB/s
10980XE	19 GFLOPS	76 GB/s

photor · April 27, 2025, 10:51am

Are your tests single-threaded or multi-threaded? Could you summarize your results on matrix-matrix multiplications? Thanks.

PetarM · April 27, 2025, 12:07pm

I updated my original post with a summary of the findings.

All tests were multi-threaded and executed on P-cores only.

Topic		Replies	Views
M2 Ultra running Julia Offtopic	11	1753	November 4, 2023
Matrix vector multiplication: impact of column major vs row major (M4 Max) Performance linearalgebra	6	165	April 10, 2025
Mac mini M4 pro vs AMD Ryzen 9 9950X for Linear Algebra? Offtopic question , recommendations , hardware	56	5948	April 24, 2025
Apple M1 GPU from Julia? GPU question	20	5872	March 31, 2023
How much faster is GPU compare to CPU GPU	16	26707	November 24, 2018

A Brief Study of Memory Bound Application Performance on M3 Ultra and M4 Max in Julia

Related topics