Mac mini M4 pro vs AMD Ryzen 9 9950X for Linear Algebra?

I hope it doesn’t sound too silly to talk about a product that has just been released but has not come to the market yet.

My use case is mostly for doing computation that can be boiled down to some linear problems (possibly large matrix-matrix multiplications and solving linear systems repetitively). It’s all in Julia (hence why I post it here vs somewhere else). For reasons I don’t want to elaborate, it’s all on CPU and hence GPU is not taken into account here.

I already bought a Ryzen 9 9950X, 96GB RAM (2*48GB) and a motherboard. But, I haven’t opened the packages yet and could return them for full refund. I saw the release of the new Mac mini which can be configured to M4 Pro (10 performance cores + 4 efficiency cores) with 64GB ram at the maximum. I imagine that the 9950X would probably still be more performant for doing numerical work, especially with AVX-512 support which should be useful. However, I am just so tempted by the physical size of the Mac mini, which could be easily carried around. It also requires much less electricity, with 100W power supply. With the 9950X, it will be a traditional tower, so definitely not something that can be easily taken away when I move around different locations. However, the Mac mini with its full capacity (except storage) would cost roughly $600 more. It’s also not obvious to me whether the performance gap would be substantial.

Any thought on that? Maybe performance of M3 pro / M3 max in Julia would be useful for comparison?

2 Likes

Here’s my take: Hardware isn’t all that matters. Especially for your local computer, because once your computations become more expensive you’ll likely work on clusters or in the cloud anyways. I’d focus on user experience and convenience instead. Ask yourself which OS you want to work with. Or if loud noise from a fan annoys you or, perhaps, gives you a great feeling because you’re using your machine to its limits. Things like this can have a much bigger influence on your experience and the joy factor (not to be underestimated).

(Personally, I’d take the Mac mini. The Mac computers I have or had were the best computers I’ve ever owned.)

19 Likes

You may buy mATX or ITX board and build a mini computer based on the Ryzen 9950X. Probably will still weigh more than the macMini, but maybe enough for you to carry around?

How important is mobility for you?
As the original configuration you bought is probably the best one can get for home user for the tasks you mentioned.

1 Like

Apple M4 =

And the X86-64 ( Advanced Matrix Extensions - AMX ) - is not supported by the AMD 9950X.

I’m not sure how well Julia supports the SME instruction set or the NPU, but it’s probably best to use via MLX. Julia access to Apple GPU with MLX, and or Metal Performance Shaders (MPS)?

The AMD 9950X offers affordable expandability (Including ECC memory), and if the code also needs to run in the cloud, this solution will be much simpler. Additionally, AMD is releasing the STRIX-HALO next year ( ~ 16 Zen 5 cores and 40 Compute Units for its RDNA 3.5 + XDNA2 + 256-bit LPDDR5X-8000 Memory Controller … ~ expected mobil RTX4070M performance ), which I think could be an ideal alternative to the Mac Mini M4 Pro.

On the 9950X - with 512-bit AVX512 32 threads workload - the CPU Speed is ~ 3.82 GHz ( Zen5's AVX512 Teardown + More... )

2 Likes

If portability is a major concern for you, you can actually fit a 9950x with only two sticks of ram into a very small pleasing ITX case (especially since you’re not too concerned about the GPU so won’t need some 3-slot monster).

I would expect that the 9950x would still significantly outperform the M4 Pro for the sorts of problems you’re talking about (with also the ability to be upgraded to even more RAM if you decide you need it later).

That said, I agree with @carstenbauer here, I’d be much more concerned about non-performance considerations here. I think the main question is just “do you want to use MacOS or not”, followed by “would you rather have Apple Care (for a price), or would you rather have the freedom and flexibility to fix your machine yourself or at a regular repair shop if something bad happens.”

And of course on the price angle, there’s always the question of “how much does $600 in savings mean to you?” For some people, $600 of saving is a huge deal, whereas there’s always those who would gladly pay $600 for the most marginal of improvements.

More info:

M4 CPU: No support for SVE or SVE2. However, you may be able to use the GPU or NPU for vector operations, thanks to unified memory

Apple M4 Support - expected with LLVM 19

  • -mcpu=apple-m4
  • “Technically apple-m4 is ARMv9.2a, but a quirk of LLVM defines v9.0 as requiring SVE, which is optional according to the Arm ARM and not supported by the core. ARMv8.7a is the next closest choice.”
  • “Extensions atop ARMv8.7a exposed are AES, SHA2, SHA3, FP16, FP16FML, SME, SME2, SMEF64F64, and AEK_SMEI16I64.”
  • “So there you have it straight from Apple… The Apple M4 is an ARMv9.2a based design. However, it lacks SVE (and SVE2) support. There were rumors that the Apple M4 supported Scalable Vector Extensions but now again by this Apple code comment and the associated ISA being exposed by the LLVM compiler, SVE/SVE2 is not present for the Apple M4.”

via

LLVM19:

3 Likes

It’s a bit hard to judge because the m4 in that configuration hasn’t been benchmarked yet, but I’d take a look at cpubenchmarks to get an idea of what you’ll be missing:
https://www.cpubenchmark.net/compare/6211vs5750/AMD-Ryzen-9-9950X-vs-Apple-M3-Pro-12-Core
I think the M4 won’t be much faster than the M3 (~3-10%?) from all I heard, but I’m not sure which M3 configuration would be closest to what you have in mind for the mac mini.
The energy efficiency and form factor is pretty amazing for the m4, so could be a pretty big pro.
The non extensability and repairability is a big con I’d say. Also, I’m not sure how much throttling the M4 will apply for longer running tasks on the mini.
At least the old m2 mac mini (non pro), didn’t seem to have suffered from throttling:
https://www.notebookcheck.net/Apple-Mac-Mini-M2-2023-review-Apple-M2-unleashing-its-power-via-desktop.745320.0.html
Not clear if the much bigger pro version comes with the same cooling and therefore more throttling.

2 Likes

I doubt it. My last two builds were “mini-ITX” cases and they are definitely not anything you would be carrying around. Also, after the last one, I’ve decided I’m not going to build any more of the small form factor PCs because I’ve found it to be nothing buy a giant pain in the arse. For me, they don’t even really save that much space so you’re just making the build much more difficult and in return all you’re really getting is something that looks cooler than a full-size tower. I’ve actually found myself reluctant to upgrade them because they are seriously that much of a pain to build in, which kind of defeats much of the purpose of the DIY route.

3 Likes

I don’t think so. M4 is very intriguing, still new to me. It, probably neither M3, yet as tuned as possible for by Julia, except if you call special non-Julia floating point libraries.

Yes, but Apple has similar to AVX-512, i.e 512-bit support (does the M3?).*

Your is a desktop CPU, and no mobile can compare, but I do see AMD (and Intel) mobile CPUs claim less than 500 GFLOPS (for single-precision Float32), and Apple’s M4 4x that.

AMD claims misleading for L1 cache: https://www.amd.com/en/products/processors/desktops/ryzen/9000-series/amd-ryzen-9-9950x.html

Zen 5 1280 KB

It’s:

80 KB (per core):

32 KB instructions
48 KB data

Yes, times 16 cores then 1280 KB, but in practice almost never all used. Such (L1 and per core) numbers are very important (and IPC, at least for integer work), and likely mostly for performance cores, and L1 cache of the slower cores can be ignored?

I believe many (all by now?) CPUs bypass L1 cache for floating point work, then L2 and L3 is most important.

Apple claims 10-issue (isn’t that rather large?), important for integer work, but likely in practice means about 3 IPC, like:

Mobile Zen 5 doesn’t enjoy the same lead, and performs very closely to desktop Zen 4. In its mobile variant, Zen 5 has a weaker AVX-512 implementation, less cache, and higher memory latency. Still, it’s able to stand even with Zen 4 despite those handicaps. Of course desktop Zen 4 will likely take the lead at stock speeds

* Apple actually can use SME from just one core (i.e. SME is independent of the cores, can be controlled by just one), and claims 2000 GFLOPS, which is rather impressive, but only for Float32 (4x faster than for Float64).

From the unofficial docs already posted (seems very intriguing hardware)::

A limited subset of SVE is supported by the SME block, and it needs to be in the streaming SVE mode to access these instructions. The scalable vector length (VL) on M4 is 512-bit, meaning that each register is 64-byte wide and that the ZA storage is 64x64 or 4096 bytes large. The SME unit can sustain 2000GFLOPS of FP32 multiply-accumulate [but not for long, limited by cache and memory?]

Apple M4 MACs can work with a wide range of data types, including 8-bit, 16-bit, and 32-bit integers, and 16-bit, 32-bit, and 64-bit floating point, as well as 16-bit brain floating point. Not all data type combinations are supported. In particular, f16 is only supported when accumulating to f32, and i16 can only be accumulated to i64.

As we will see, the SVE/SME abstraction is leaky. […]
The most straightworward way is using the FMLA instruction in streaming SVE mode. This instruction performs vector multiplication with accumulation into a vector destination. However, as shown by the team at Uni Jena, this only reaches a dissapointing 31 GFLOPS for the f32 data format, considerably less than what the Neon SIMD of an M4 P-core is capable of. Does this mean that M4 SME is useless for vector operations? Not at all!

Results

SME features

The following SME features are reported for Apple M4

  • FEAT_SME
  • FEAT_SME2
  • SME_F32F32
  • SME_BI32I32
  • SME_B16F32
  • SME_F16F32
  • SME_I8I32
  • SME_I16I32
  • FEAT_SME_F64F64
  • FEAT_SME_I16I64

Notably missing is 8-bit floating point support and operations on half-precision (16-bit) floating point except accumulate to single-precision (32-bit). Brain-float 16-bit floating point is instead supported fully.

I do not know what SME_BI32I32 refers to. Possibly this is a typo in the feature string and it is supposed to be I32I32 i.e. operation on 32-bit integers?

SME matrix multiplication performance

SME matrix multiplication is done with outer products. A single outer product multiplies all elements of two vectors and accumulates them into a ZA tile. …
For optimal use of the SME unit, it’s crucial to understand that outer product instructions are pipelined. This means that to achieve the maximal possible compute rate, we must execute sequences of multiple instructions. A strategy to consider is accumulating to different ZA tiles (this is also pointed out by the Jena team). For instance, when accumulating to fp32, there are four tiles ZA0-ZA4.

The table below shows the results of executing the MOPA (outer product and accumulate) instruction for various datatypes and with different numbers of ZA tiles used for accumulation. The column type is the data type (two types are used for widening operations). The column ZA tiles is the number of different tiles used for accumulation (‘full’ means that the entire ZA storage is used). Finally, GFLOPS is the measured compute rate in operations. A single MAC counts as two operations (multiplication + addition). In the case of integer data, the more correct term would be GOPS.

type ZA tiles GFLOPS
f32 4 (full) 2005.3
f32 3 1503.02
f32 2 1003.15
f32 1 500.63
f64 8 (full) 501.73
2 Likes

Another consideration to make is how long you think your compute tasks will take. I haven’t used a mac in a long time and would be interested to be corrected if this is not correct, but my impression is that the mac mini machines have minimal or even no active cooling. I don’t know if the M{1,2,3,4} chips ever get hot enough to thermal throttle, but I would think that if you are using every core of the machine and trying to go as fast as possible you will eventually hit some thermal issues.

If you build a custom AMD machine, on the other hand, even in a reasonably small form factor I bet you could get much more effective cooling in there, and so maybe for things that take, say, 10+ minutes or something the AMD machine would really pull ahead.

But with that said, I could certainly be wrong! I mostly write to float the thought that peak CPU performance only matters if you can actually pull heat away from it fast enough for it to continue reaching that peak performance.

EDIT: ah, I see @sdanisch also mentioned this, sorry.

2 Likes

The appropriate numbers to estimate performance for this chip are L1/core, L2/core and L3 total. This is because L1 and L2 are exclusive to each core, while L3 is shared. In older days L2 was shared on some chips, so the appropriate number for these chips was L2 total.

I wouldn’t call it misleading in the sense of accusing AMD – this way of breaking down the numbers is unfortunately pretty much standard. And it absolutely cannot mislead anyone:
Only people who know about L1 can be mislead by a number on L1 size, and such people know that a number like 1MB for L1 must obviously be divided by the number of cores. (I do however reserve a special place in hell for people who cite a total in cases where you have different L1 sizes for P and E cores).

PS edit. Palli and me are in almost complete agreement on this, our only difference was “give me the actual number you marketing nitwits :angry:” vs “oh, silly number is silly :person_shrugging:”.

This is new to me. Can you link a source for that?

Or did you mean that L1 is pretty irrelevant for bandwidth-bound workloads? Or that some libraries use nontemporal loads/stores in order to not trash L1d with unneeded garbage?

It isn’t (or at least wasn’t) universal:

I am not aware of any x86 implementation that supported L1 cache bypassing for cacheable accesses unless one includes NT writes (which go to a special buffer and bypass all caches).

However, the MIPS R8000 (or specifically the R8010 FPU) had all FP memory accesses bypass L1, communicating directly with the L2 cache. For Itanium 2 (and follow-ons): “Floating-point loads are not cached in the L1D and are instead processed directly by the L2.”

E.g. the Itanium is dead, yes, but since the technique was valuable, I thought might have survived in later chips (and yes, the instructions themselves would always use L1I cache).

This is an implementation detail so you might never know really what is happening. With L1D very valuable, I see it as still a good way to operate.

This would apply for the accelerator in M4 I believe (and maybe or maybe not FP loads/stores in each core too), since it’s another unit, not per se owned by any core.

SME in Apple M4

Apple’s matrix accelerator is a dedicated hardware unit — it is not part of the CPU core. There is one AMX/SME block in a CPU cluster, shared by all CPU cores.

1 Like

I saw a talk on performance from a C++ conference where the presenter was misled. I had to stop watching at that point.

Personally, I’d take the 9950X (and in fact do own one).

Note that AFAIK, most AVX512 matmul implementations do just trash the L1/get no reuse from it.
L2 → L1 bandwidth on these computers is high enough that you this is more or less okay, assuming you’re able to hide the latency (and do make full use of each cacheline in a big microkernel).

I do not know of any way to actually avoid writing to it, even using non-temporal loads go there (but these do avoid writing to l2 or l3).

4 Likes

Seems like the first benchmarks are coming in:
https://browser.geekbench.com/v6/cpu/compare/8598019?baseline=8588693
https://www.cpubenchmark.net/compare/6040vs6211/Apple-M4-10-Core-vs-AMD-Ryzen-9-9950X

It looks like some workloads are much faster on the AMD, but in average it’s quite similar, which is kind of amazing considering the form factor and energy usage.
I love AMD and self build tower PCs where I have control over everything, but the M4 Pro is certainly pretty attractive.

1 Like

Fully agree. Furthermore, if you don’t always necessarily need all processor power you can get, they can stay in service for 8…10 years. That said, actually I started to think about replacing my current Mac Mini M1 by a M4 model, even though there is little rational reason for that.

On the other side, your AMD would be probably more performant for your type of usage, and it is expandable. Mac Mini, while easy to carry around on itself, needs monitor-mouse-keyboard, so what is the real use of it?

What about using your AMD as a server, and connecting to it per remote access from a notebook if you are underway?

3 Likes

Thank you so much for everyone who replied here! It looks like this comparison is indeed interesting to many.

2 important points (for me) in favor of apple machines:

  • the huge RAM bandwidth (546 GB/s for M4 max) makes a large difference for BLAS 1 and 2 linear algebra operations (e.g. dot product).
  • the complete silence during coding is (for me) the strongest selling point. I have tried to set up a quiet and powerful desktop based on x86. The desktop is considered quiet by my colleagues but is not silent. My macbook pro is. It would be very painful (for me) to go back. At this point I would wait for the first reviews on new apple devices to confirm that the complete silence is still there (I have heard that some apple studio machine could be noisy).
1 Like

My macbook pro is very noisy.

M serie ?

1 Like

Here is an update for anyone who comes into this thread by chance. I ended up with the 9950X and built an Ubuntu PC with it. I am happy with it so far as a “personal server” staying behind SSH and wired to a switch. With a decent CPU cooler, it stays silent when staying idle and only makes some noise when running something like y-cruncher that pushes the CPU to the maximum load.

Some quick benchmark following the post here:

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, generic)
Threads: 16 default, 0 interactive, 8 GC (on 32 virtual cores)
Environment:
  JULIA_NUM_THREADS = 16

julia> using LinearAlgebra

julia> N=449*10*2;

julia> A = rand(N,N); B = rand(N,N); C = similar(A);

julia> 2e-9N^3 / @elapsed mul!(C,A,B)
# First Run 1268.744956058828

julia> 2e-9N^3 / @elapsed mul!(C,A,B)
1736.538226330524

julia> 2e-9N^3 / @elapsed mul!(C,A,B)
1729.8860585152077

Ignoring the very first run affected by compile time, I got 1.7 TFLOPS, which is much better than the 0.35 TFLOPS I got on my current laptop.

Additional info for the PC build:

  • Default settings for CPU without tweaking anything in BIOS
  • Corsair DDR5 RAM 96GB running at 6000MHz via XMP profile and 1:1 ratio with controller
  • Ubuntu 24.10
5 Likes