Hello everybody!
I made a video investigation of Apple’s AMX accelerator Linear Algebra performance in Julia: https://www.youtube.com/watch?v=TjfA9LVgHXk
According to my findings, the 2 AMX cores achieve almost 3 times the peak performance of the 12 P-cores in FP32 and close to same performance as the Ryzen 9950X in dense matrix-matrix multiplication. At the same time they are 10 times more power efficient than the P-cores and about 6 times more power efficient than the Zen 5 cores.
Somewhat disappointingly, the AMX cores have the same throughput in FP16 as in FP32, even though according to Apple’s patents it seemed reasonable to expect a 4-fold increase
In FP64, the AMX core throughput drops to 1/4 that of FP32, as expected. However, this is still superior to the 12 P-core performance.
The other problem I examined was matrix-vector multiplication and there the performance was memory bound and roughly matched that of the P-cores.
When comparing against the GPU, it becomes quite apparent why Apple introduced the AMX: it uplifts performance exactly where the GPU is weak: small problem sizes and it does so with significantly higher power efficiency.
Curiously, I achieved better peak GPU performance in Julia than using Apple’s MLX! Kudos to everyone who made it possible to so easily leverage the GPU in Julia.