I don’t think so. M4 is very intriguing, still new to me. It, probably neither M3, yet as tuned as possible for by Julia, except if you call special non-Julia floating point libraries.
Yes, but Apple has similar to AVX-512, i.e 512-bit support (does the M3?).*
Your is a desktop CPU, and no mobile can compare, but I do see AMD (and Intel) mobile CPUs claim less than 500 GFLOPS (for single-precision Float32), and Apple’s M4 4x that.
AMD claims misleading for L1 cache: https://www.amd.com/en/products/processors/desktops/ryzen/9000-series/amd-ryzen-9-9950x.html
Zen 5 1280 KB
It’s:
80 KB (per core):
32 KB instructions
48 KB data
Yes, times 16 cores then 1280 KB, but in practice almost never all used. Such (L1 and per core) numbers are very important (and IPC, at least for integer work), and likely mostly for performance cores, and L1 cache of the slower cores can be ignored?
I believe many (all by now?) CPUs bypass L1 cache for floating point work, then L2 and L3 is most important.
Apple claims 10-issue (isn’t that rather large?), important for integer work, but likely in practice means about 3 IPC, like:
Mobile Zen 5 doesn’t enjoy the same lead, and performs very closely to desktop Zen 4. In its mobile variant, Zen 5 has a weaker AVX-512 implementation, less cache, and higher memory latency. Still, it’s able to stand even with Zen 4 despite those handicaps. Of course desktop Zen 4 will likely take the lead at stock speeds
*
Apple actually can use SME from just one core (i.e. SME is independent of the cores, can be controlled by just one), and claims 2000 GFLOPS, which is rather impressive, but only for Float32 (4x faster than for Float64).
From the unofficial docs already posted (seems very intriguing hardware)::
A limited subset of SVE is supported by the SME block, and it needs to be in the streaming SVE mode to access these instructions. The scalable vector length (VL) on M4 is 512-bit, meaning that each register is 64-byte wide and that the ZA storage is 64x64 or 4096 bytes large. The SME unit can sustain 2000GFLOPS of FP32 multiply-accumulate [but not for long, limited by cache and memory?]
Apple M4 MACs can work with a wide range of data types, including 8-bit, 16-bit, and 32-bit integers, and 16-bit, 32-bit, and 64-bit floating point, as well as 16-bit brain floating point. Not all data type combinations are supported. In particular, f16 is only supported when accumulating to f32, and i16 can only be accumulated to i64.
As we will see, the SVE/SME abstraction is leaky. […]
The most straightworward way is using the FMLA instruction in streaming SVE mode. This instruction performs vector multiplication with accumulation into a vector destination. However, as shown by the team at Uni Jena, this only reaches a dissapointing 31 GFLOPS for the f32 data format, considerably less than what the Neon SIMD of an M4 P-core is capable of. Does this mean that M4 SME is useless for vector operations? Not at all!
…
Results
SME features
The following SME features are reported for Apple M4
FEAT_SME
FEAT_SME2
SME_F32F32
SME_BI32I32
SME_B16F32
SME_F16F32
SME_I8I32
SME_I16I32
FEAT_SME_F64F64
FEAT_SME_I16I64
Notably missing is 8-bit floating point support and operations on half-precision (16-bit) floating point except accumulate to single-precision (32-bit). Brain-float 16-bit floating point is instead supported fully.
I do not know what SME_BI32I32
refers to. Possibly this is a typo in the feature string and it is supposed to be I32I32
i.e. operation on 32-bit integers?
SME matrix multiplication performance
SME matrix multiplication is done with outer products. A single outer product multiplies all elements of two vectors and accumulates them into a ZA tile. …
For optimal use of the SME unit, it’s crucial to understand that outer product instructions are pipelined. This means that to achieve the maximal possible compute rate, we must execute sequences of multiple instructions. A strategy to consider is accumulating to different ZA tiles (this is also pointed out by the Jena team). For instance, when accumulating to fp32, there are four tiles ZA0-ZA4.
The table below shows the results of executing the MOPA (outer product and accumulate) instruction for various datatypes and with different numbers of ZA tiles used for accumulation. The column type
is the data type (two types are used for widening operations). The column ZA tiles
is the number of different tiles used for accumulation (‘full’ means that the entire ZA storage is used). Finally, GFLOPS
is the measured compute rate in operations. A single MAC counts as two operations (multiplication + addition). In the case of integer data, the more correct term would be GOPS.
type |
ZA tiles |
GFLOPS |
f32 |
4 (full) |
2005.3 |
f32 |
3 |
1503.02 |
f32 |
2 |
1003.15 |
f32 |
1 |
500.63 |
f64 |
8 (full) |
501.73 |
… |
|
|