- Apple M2 Ultra with 24-core CPU, 60-core GPU, 32‑core Neural Engine
- 192GB unified memory
How fast would Julia be on this compared with “equivalent” Intel hw?
Serial and threaded, if you have those numbers…
Edit: As pointed out below, I am hopeful that someone has already done some benchmarking.
But anecdotes are welcome too.
How do you measure how “fast” Julia is? I world say it depends on the workload?
You could maybe run GitHub - IanButterworth/SystemBenchmark.jl: Julia package for benchmarking a system and compare it across machines.
CPU performance wise it won’t beat AMD Ryzen 7950x.
Though the memory sub system has more bandwidth and RAM.
So in cases the 7950x is memory bounded the M2 might not.
So usually it is better think on Apple’s CPU is the context of significant and enablers.
Mainly since their GPU memory is shared with CPU which means they can do things which require high memory (They are very attractive to those who do inference with large models).
I have the M2 Max (The top tier configuration of the Max).
My experience for M1 Max :
- impressive CPU bandwidth : x5 compared to x86 laptop and rivals with Xeon servers
- impressive energy efficiency : compete with x86 server swith no noise and no electric plug
- impressive memory for the GPU : try to buy a discrete GPU with 192 GB of RAM
Metal.jl is far less mature than CUDA.jl
Assuming a particular interest for PDEs and Finite Element Method, the main issue is probably the
state of GitHub - JuliaLinearAlgebra/AppleAccelerate.jl: Julia interface to the macOS Accelerate framework which (I can be wrong) does not allow for calling Apple Accelerate sparse solvers.
Precisely. My hope was that someone had already done that…
I doesn’t seem like it currently does. I wonder how feasible it is, given that the sparse routines in Accelerator are probably(?) not in one-to-one correspondence with SuiteSparse’s
I hadn’t given it much thought until I saw this thread, but the speedup from using Accelerate over OpenBLAS can hardly be overstated! Maybe that was clear to everyone but me
Simple benchmark: matmul on dense 1000x1000 Float32 matrices yields a whooping speedup of ~4x for me (with 4 BLAS threads). Granted, I tried it on a “meager” M1 (not Pro, Max or Ultra), so it might be less pronounced on the beefier SOCs.
Still, pretty great for a drop-in replacement.
Or frustrating, because Apple is, from what I learned today, pretty tight-lipped about their “secret ingredient”, which is a FMA accelerator unit in the Silicon chips dubbed Apple Matrix Coprocessor (AMX). Apple does not disclose how to use it, and officially makes it only accessible through the Accelerate framework. It’s been reverse engineered though.
Again, might be common knowledge, but TIL!
In my case, I am using M1 Max instead of the Ultra. I can say that it is beating a huge Xeon server easily when performing complex satellite simulations. In this case, we does not have much room to make parallelizations, which would be better for the Xeon (64 cores).
Given that the M2 I am using cost a fraction of the price of the Xeon server, IMHO, the M-arch computers from Apple are the best platforms to perform those kind of computations.
Just one tip when benchmarking, enable this package: GitHub - JuliaLinearAlgebra/AppleAccelerate.jl: Julia interface to the macOS Accelerate framework
Shouldn’t the cost comparison be apples-to-apples? (Pun intended… ;-))
The number of cores, memory, caches, … are likely quite different, aren’t they?
Yes, it is! The Xeon of course is much more capable, but its performance for those simulations are way behind. I think the M1 Max completes one scenario in 40% less time. Hence, the relative cost of the M1 would still be better even if I selected a lower end Xeon which will likely have worse performance than the current one.
Sorry to jump into this conversation, we are currently looking for a workstation/server to run similations in a 7-10 researchers environment (julia but also a lot of R and a few python and matlab).
Why nobody names the AMD Threadripper CPUs? Looking at benchmarks it seems the best single thread (that for many custom program remains important) while not having the 128GB ram limitation of desktop cores and still a lot of cores…
I have just acquired a workstation with an AMD Ryzen 9 7950X 16-Core Processor, 4501 Mhz, 128 GB DDR5, cache 1, 16, 64 MB. It is very snappy. My colleague has now purchased a M2 Ultra (it hasn’t arrived yet, so I cannot run comparisons myself).
One of the main differences is the max memory bandwidth. For you CPU, Google says ~73 GB/s, which is decent, but nowhere near the 800 GB/s you get with M2 Ultra. And sparse matrix multiplication, which is at the heart of solving PDEs, is memory bandwidth-limited.