Apple M4 Max AMX Linear Algebra performance versus CPU and GPU

Hello everybody!

I made a video investigation of Apple’s AMX accelerator Linear Algebra performance in Julia: https://www.youtube.com/watch?v=TjfA9LVgHXk

According to my findings, the 2 AMX cores achieve almost 3 times the peak performance of the 12 P-cores in FP32 and close to same performance as the Ryzen 9950X in dense matrix-matrix multiplication. At the same time they are 10 times more power efficient than the P-cores and about 6 times more power efficient than the Zen 5 cores.

Somewhat disappointingly, the AMX cores have the same throughput in FP16 as in FP32, even though according to Apple’s patents it seemed reasonable to expect a 4-fold increase

In FP64, the AMX core throughput drops to 1/4 that of FP32, as expected. However, this is still superior to the 12 P-core performance.

The other problem I examined was matrix-vector multiplication and there the performance was memory bound and roughly matched that of the P-cores.

When comparing against the GPU, it becomes quite apparent why Apple introduced the AMX: it uplifts performance exactly where the GPU is weak: small problem sizes and it does so with significantly higher power efficiency.

Curiously, I achieved better peak GPU performance in Julia than using Apple’s MLX! Kudos to everyone who made it possible to so easily leverage the GPU in Julia.

15 Likes

Very nice!

Is the code available?

Great video! Just bumped into your video on Youtube and then saw the post here.