Cache sizes getting smaller (some, i.e. L1; though not per core-cache) for AMD (unlike for Apple), for multi-threading?

Apple M1 was an impressive chip with for L1 cache (and those numbers matter, not just the larger MB-sized numbers people like to promote for L2+):

Performance cores
192+128 KiB per core
Efficiency cores
128+64 KiB per core

These numbers are unchanged for M3, and at least for M1, the former numbers are for instruction cache (Icache) and the alter for Dcache, assume also for M3. Both it, and Dcache, are important, and those are the largest I’ve ever seen. I don’t have data for M4.

But looking at the table for AMD’s microarch (world-class) Zen 1 to 5 I see Icache is getting smaller, halving to 32 KB for Zen 2 up to 5:

All else equal it seems worse, and if there’s a good reason for 192 KB in Apple’s why only 32 KB in AMD?

I think the reason could be the “μop” cache, and to be fair its size has consistently been going up (though data on it is missing for Zen 5).

So on ARM, each instruction is 4 bytes, holding 192*1024/4 = 49152 instruction presumably in L1 cache?

But for AMD only “6.75K”?! Note μop, micro-ops, means CISC x86 instructions are broken into 1 or more microops (so potentially even fewer x86 instructions stored), and those are stored in the μop cache (which is I believe separate from the L1 cache, if not how do they fit more ant more into the same 32 KB?). [The opposite is also done “they combine certain machine instruction sequences (such as a compare followed by a conditional jump) into a more complex μop which fits the execution model better”, exactly how μop are defined (and max. 2 merged?) is likely a closely guarded secret and subject to change, each μop may be getting smaller to fit more into same number of KB.]

The L1 KB numbers for Apple’s Arm could actually be split into L1 and μop, maybe evenly, or not, for all I know, for about 20+K instructions? I believe some Arm/RISC chips at least have microops, since RISC chips are less RISC than they used to be…

[Intel had a trace cache with Pentium 4, and I believe no current chips have such “Intel later introduced a similar but simpler concept with Sandy Bridge called micro-operation cache (UOP cache).”]

Also the Dcache is only up to 48 KB for AMD, vs 128 KB for Apple, and at least that’s an Apples to apples (no pun intended) comparison.

EDIT: I see not the L2 AMD cache is massive and per core, and getting larger to 1 MB, compensating for small L1 I- and D-cache, though of course with lower latency. Apples isn’t per core, and has no L3, unlike AMD.

I think this reflects that AMD is going for very many cores (focusing more on data centers, and e.g. desktops and games an afterthought, unless programmers for are forced to eliminate single-threading more, or have already), and then per core cache size is expensive, you can rather have it smaller, and more cores. Sacrificing single-threaded performance (and well for any one core for for multithreaded, until you multiply by numbers of cores used at once)?!

So in short are Apple’s chips the best around still for single-threaded at least? Any good competitors?

Another thing I find interesting is that the latency for AMD, L1 cache is “4–8” [cycles], I would have thought the ideal single-cycle. You need to issue a new instruction every cycle, this might be explained by pipelining and/or rather the μop being the important part. And its latency, anyone know it?

The Dcache, for AMD, actually also has “4–8” cycle latency. All instructions need data, usually from registers, not from cache or memory, though also often not just from registers. I also thought Dcache was single-cycle… at least no longer? Or can the CPU prefetch from Dcache for upcoming instructions?

1 Like

I’m certainly out of my depth here, but my understanding is that with the Zen-line architectures, AMD focused on a chiplet design which allowed them to keep fab costs down by using the same chiplets across CPUs with a wide range of core counts.

So it would make sense to me that the L1 L2 cache is core-local, partially for their focus on server chips, but even more so because it fits with the overall design of wanting Lego style chiplets that can be stacked together in many configurations.

This stands in stark contrast to Apple Silicone’s design philosophy of system on a chip where there’s basically no configurability. They are targeting just a handful of different CPU configurations so they can spend time making sure share cache works well.

L1 is always core-local. [Now fixed above: You meant to write L2 is core-local for AMD.]

Historically L2 has always been shared, not core-local. To be core-local, not shared for L2 makes sense, for chiplets (many but not all cores on each). L3 somewhere else global, and shared.

Sorry, yep! I misspoke.

Could it be related to the fact that Apple uses 16KiB pages while the PC world uses 4KiB pages?

This explanation sheds some light on why AMD L1 could be only 32 KiB.

4 Likes

Not only 4x page size in Apple M1 (that I didn’t think of implicating), but also the the cache line size is 2x i.e. 128 bytes (also in PowerPC) vs x86 (by now, used to be 16-byte). I think the trend is to longer and we will never go below 64-byte ones again.

1 Like

You can compare other CPUs in Latency Data

4 Likes