It doesn’t get said enough. Apple is a RAM vendor. 50% margin on the base config. 95% margin on memory upgrades.
Not really true:
It is important to know that HBM is about five times pricier than DDR5. It commands a significantly higher cost due to its superior performance and capacity advantages over standard DRAMs. The complexity of constructing HBM memory devices and stacks is also notably higher compared to traditional DDR ICs and modules.6 May 2024
Highest performance comes at a price.
Apple isn’t shipping HBM. They’re shipping LPDDR (although in a non-JDEC spec)
I was that clever as I bought a Mac mini with 256 GB plus external Thunderbird SSD with 1 TB. While I find this configuration usable for me, I can’t unconditionally recommend it.
First, having data not where the OS expect them to be tends to cause troubles with access privileges. On setting up my Mac the first time I intended to have my user folder on the external disk – which is possible in theory. That worked until an OS update… After prolonged fights, I gave up on it. After all, the main reason I prefer Mac is that I actually want just to use my computer to get the things done, not to mess around with the OS. Then I gave up on having Music
and Photos
libraries as well as Julia projects on that disk - meaning, even though I don’t have that much there by modern standards, my system disk is pretty full now.
Second, I had occasional crashes leading to computer restart. While not often, that almost never happened to me before on Macs. I tracked the crashes tentatively to external disk (communication) glitches when running an application from that disk. After all applications were moved onto internal disk, I never had such crashes anymore.
Thanks for sharing.
I did not experience this solution myself since I bought a quite expensive version (MBP M1Max 64 GB / 2To) a few years ago in a professional context and feel no need to upgrade ever since.
So the conclusion remains the same: Apple makes very large margin on SSD and RAM upgrades.
The competition gets better though, and I will be very happy to return to Linux OS when one of the competitors (AMD, Qualcomm, Intel, NVidia…) manage to offer a similar performance (per Watt).
Thanks, I needed a good laugh.
This benchmark is good but it is not clear if it exploits all 16 threads or not. Perhaps it would have been better to try and set JULIA_NUM_THREADS=1, and repeat the benchmark. This way we know (or have an idea) about single core performance. Would you be kind enough to update your results with single core tests? Then I can give you some pseudocode you can run to also test Sparse Matrix performance which is mostly used in numerical computing.
Hi Norman!.
Could you also benchmark matrix-vector multiply GFLOPS in Julia? I would expect your system to score around 20 GFLOPS based on your memory speed.
Here are the results for both Matrix-Vector multiplication and Matrix-Matrix multiplication with Julia v1.11.5:
julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, generic)
Threads: 16 default, 0 interactive, 8 GC (on 32 virtual cores)
Environment:
JULIA_NUM_THREADS = 16
julia> using LinearAlgebra
julia> N = 449*10*2;
julia> A = rand(N, N); B = rand(N); C = similar(B);
julia> 2e-9N^2 / @elapsed mul!(C,A,B)
# First Run 1.5482084452662266
julia> 2e-9N^2 / @elapsed mul!(C,A,B)
14.976101777558263
julia> 2e-9N^2 / @elapsed mul!(C,A,B)
14.737536001836334
julia> 2e-9N^2 / @elapsed mul!(C,A,B)
15.0477971386375
julia> 2e-9N^2 / @elapsed mul!(C,A,B)
14.928891417753537
julia> A = rand(N, N); B = rand(N, N); C = similar(B);
julia> 2e-9N^3 / @elapsed mul!(C,A,B)
# First Run 1256.2057391195665
julia> 2e-9N^3 / @elapsed mul!(C,A,B)
1736.3682467430988
julia> 2e-9N^3 / @elapsed mul!(C,A,B)
1741.1184796553234
julia> 2e-9N^3 / @elapsed mul!(C,A,B)
1744.847565528222
I tried the nightly version of Julia as well. The results are similar.
Hi! From what I understand, JULIA_NUM_THREADS
wouldn’t make a difference as the OpenBLAS handles the threads. I tried starting Julia with --threads 1 and doesn’t seem to change the results. It seems that OpenBLAS will use 16 threads anyway:
julia> BLAS.get_num_threads()
16
you can use BLAS.set_num_threads(1)
.
Thanks Norman!
The 15 GFLOPS result is consistent with the 60 GB/s bandwidth observed in the STREAM Triad benchmark conducted by Phoronix. It’s actually slower than the 6 year old Cascade Lake 10980XE that achieved 19 GFLOPS in Julia as reported by @Elrod .
I’ll be publishing a video in the coming days where I did some Linear Algebra performance testing on the M4 Max and M3 Ultra in Julia. In this particular test, the M4 Max is 5 times faster and achieves comparable STREAM bandwidth on 12 P-cores to a 96 core 12 channel DDR5 Zen 5 Epyc.
Note that the Intel 10980XE has 4 channels of DDR4, while your AMD 9950X has 2 channels of DDR5.
The DDR5 probably has less than twice the bandwidth (depending on how you clock it, IIRC my DDR4 is @3200), so having 2x the channels should mean the 6 year old chip should still come out ahead in most cases.
12 channels on the Zen5 Epyc is much better.
It probably suffers from needing a large number of cores to realize that bandwidth?
Have you tried setting different thread counts when comparing to the M3 Ultra and M4 Max?
My source for the Epyc, was a benchmark run from Phoenix: AMD EPYC Turin 8c Vs. 12c Memory Channel DDR5 Comparison Benchmarks - OpenBenchmarking.org
However, I’ve seen that in the latest report by Fujitsu they report a much higher BW in Triad for the same processor and with the same DDR5-6000 RAM: https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-performance-report-primergy-rx2450-m2-ww-en.pdf
It’s as if they applied a 1.33 multiplier for the TRIAD to account for potential write-allocate?
I was wondering the same about my M4 Max and M3 Ultra results: they achieved 25% higher BW in the Update kernel compared to the Triad kernel, but about the same in the copy kernel. This is something I intend to investigate further in the future.
If you need an alternative to the Mac mini M4 pro, please consider this AMD CPU: https://www.amd.com/en/products/processors/laptop/ryzen/ai-300-series/amd-ryzen-ai-max-plus-395.html
It has a much higher RAM bandwidth than the Ryzen 9950X. For example this PC: Framework | Order a Framework Desktop with AMD Ryzen™ AI Max 300
These goes along with unified memory of 32GB up to 128GB, enabling up to 256GB/s of memory bandwidth, compared to Ryzen 9 9950X’s theoretical 96 GB/s.
But that one is much more expensive.
The base model is a lot more expensive than the base mac mini, but once you’re looking at 64 GB of RAM (which you’ll want if you want to make use of all that memory bandwidth), the prices equalize, and the Framework machine can go up to 128 GB of ram and it’s SSD options are way cheaper. (It also comes with a way beefier GPU, and more CPU cores)