Apple M1 was an impressive chip with for L1 cache (and those numbers matter, not just the larger MB-sized numbers people like to promote for L2+):
Performance cores
192+128 KiB per core
Efficiency cores
128+64 KiB per core
These numbers are unchanged for M3, and at least for M1, the former numbers are for instruction cache (Icache) and the alter for Dcache, assume also for M3. Both it, and Dcache, are important, and those are the largest I’ve ever seen. I don’t have data for M4.
But looking at the table for AMD’s microarch (world-class) Zen 1 to 5 I see Icache is getting smaller, halving to 32 KB for Zen 2 up to 5:
All else equal it seems worse, and if there’s a good reason for 192 KB in Apple’s why only 32 KB in AMD?
I think the reason could be the “μop” cache, and to be fair its size has consistently been going up (though data on it is missing for Zen 5).
So on ARM, each instruction is 4 bytes, holding 192*1024/4 = 49152 instruction presumably in L1 cache?
But for AMD only “6.75K”?! Note μop, micro-ops, means CISC x86 instructions are broken into 1 or more microops (so potentially even fewer x86 instructions stored), and those are stored in the μop cache (which is I believe separate from the L1 cache, if not how do they fit more ant more into the same 32 KB?). [The opposite is also done “they combine certain machine instruction sequences (such as a compare followed by a conditional jump) into a more complex μop which fits the execution model better”, exactly how μop are defined (and max. 2 merged?) is likely a closely guarded secret and subject to change, each μop may be getting smaller to fit more into same number of KB.]
The L1 KB numbers for Apple’s Arm could actually be split into L1 and μop, maybe evenly, or not, for all I know, for about 20+K instructions? I believe some Arm/RISC chips at least have microops, since RISC chips are less RISC than they used to be…
[Intel had a trace cache with Pentium 4, and I believe no current chips have such “Intel later introduced a similar but simpler concept with Sandy Bridge called micro-operation cache (UOP cache).”]
Also the Dcache is only up to 48 KB for AMD, vs 128 KB for Apple, and at least that’s an Apples to apples (no pun intended) comparison.
EDIT: I see not the L2 AMD cache is massive and per core, and getting larger to 1 MB, compensating for small L1 I- and D-cache, though of course with lower latency. Apples isn’t per core, and has no L3, unlike AMD.
I think this reflects that AMD is going for very many cores (focusing more on data centers, and e.g. desktops and games an afterthought, unless programmers for are forced to eliminate single-threading more, or have already), and then per core cache size is expensive, you can rather have it smaller, and more cores. Sacrificing single-threaded performance (and well for any one core for for multithreaded, until you multiply by numbers of cores used at once)?!
So in short are Apple’s chips the best around still for single-threaded at least? Any good competitors?
Another thing I find interesting is that the latency for AMD, L1 cache is “4–8” [cycles], I would have thought the ideal single-cycle. You need to issue a new instruction every cycle, this might be explained by pipelining and/or rather the μop being the important part. And its latency, anyone know it?
The Dcache, for AMD, actually also has “4–8” cycle latency. All instructions need data, usually from registers, not from cache or memory, though also often not just from registers. I also thought Dcache was single-cycle… at least no longer? Or can the CPU prefetch from Dcache for upcoming instructions?