No, it was the 11900K that Gamers Nexus called “a waste of sand”.
The 10900K was a 10 core Skylake chip.
The 11900K was an 8 core Rocket lake.
I’m a fan of AVX512, so I’m inclined to disagree and would take the 8 core rocket lake over the 10 core skylake.
However, rocket lake has only a single 512 bit FMA unit (actually, its two 256 bit units working together). This was also the case for certain Skylake-X chips, Ice Lake client, Tiger Lake, and Alder Lake before AVX512 was disabled altogether.
As such, you probably aren’t going to get better gemm performance: running AVX2, you use ports 0 and 1 to each do one 256 bit FMA per cycle, for 2x256 bits of FMA/cycle.
With AVX512, they work together to do 1x512 bits of FMA/cycle.
Most Skylake-X and Ice Lake server have 2x512 bits. These are the chips that do well on matrix multiply.
Even on chips with just 1, it is easier to achieve peak performance with AVX512, however.
Execution is just one part of the pipeline.
Decoding one instruction is faster than decoding two, scheduling one vs two, etc…
However, GEMM is optimized well enough that it tends to be bottlenecked by execution.
There is some discussion here, where I found that my Tiger Lake CPU reached >99% of the theoretical peak performance, while IIRC the M1 was in the low 90% area.
The M1 has the same execution capability, but in the form of 4x 128 bit = 512. It’s just much harder to decode and schedule 4 instructions than 1.
While clock for clock theroetical peak performance of the 11900K and 9700K are equal, it should be easier to get close to the former’s peak. Aside from being easier to schedule fewer instructions, the 11900K has better out of order capabilities, can fit 4x the data into named registers, has 50% larger L1 data cache, and 100% larger L2 cache.
Still, MKL is very good at getting close to 100% peak at large sizes, so it being more difficult to do so for the 9700K doesn’t really imply that MKL isn’t getting close enough anyway.
So clock speed is probably more important for GEMM. I’d have thought the 11900K’s clock speed is higher. Maybe the 9700K is overclocked and the 11900K isn’t? Or perhaps there are differences in cooling/thermal throttling?
Could also be that MKL is badly tuned for rocket lake. Maybe it treats rocket lake like Skylake-X, even though Skylake-X has twice the L2 cache (and 4x larger than Skylake’s), and can thus use much larger L2 blocks. Unlikely, but things like it are possible.
How does Octavian.jl compare on both?
It reads hardware data to generate specific code, but that code isn’t as well optimized, so it’s likely to have an easier time being fast on the 11900K.