I would like to know what Julia developers may have to share about Julia development on apple silicon (M1) machines. What is nice, what is wrong, is it possible to use GLMakie, multi-threading perfs and so on…
As soon as native Linux is usable for daily use (should be soon since I wouldn’t need GPU acceleration), I’d like to run some Julia benchmarks. Most of my simulations can easily use many threads but don’t scale well because they saturate memory bandwidth. But at 400 GB/s, the M1 Max has 8x the memory bandwidth of my current laptop, so it should be a big speedup.
Building Julia from scratch takes 4 minutes total! When it works, it’s very snappy.
- SVD test segfaults on Apple M1 · Issue #41440 · JuliaLang/julia · GitHub
- Darwin/ARM64: Julia freezes on nested `@threads` loops · Issue #41820 · JuliaLang/julia · GitHub
and the other issues labelled with apple silicon.
Yes, I just tried
and worked correctly. Timing:
julia> @time begin using GLMakie function sphere(n) u = range(0, stop=2*π, length=n) v = range(0, stop=π, length=n) x = sin.(u) * sin.(v)' y = cos.(u) * sin.(v)' z = ones(n) * cos.(v)' return (x,y,z) end (X,Y,Z) = sphere(201) R = 1 .- (1 .- mod.(0:0.1:20,2)) .^ 2/15 RX = R .* X RY = R .* Y RZ = (0.8 .+ (0 .- (1:-0.01:-1)' .^ 4) * 0.2) .* Z .* R scene = Scene(show_axis=false) surface!(scene, RX, RY, RZ, color=fill("#ff7518",1,1)) surface!(scene, X/12, Y/12, Z/2 .+ 0.4, color=fill("#080",1,1)) display(scene) end 14.681251 seconds (67.44 M allocations: 3.864 GiB, 5.15% gc time, 69.73% compilation time)
Good, when it doesn’t deadlock. Using more than 4 threads may not be beneficial though (at least with the original M1)
This is exactly why I am interested by this machine that overcomes my natural reluctance to closed OS as a Linux user. The (announced) CPU memory bandwidth is 20x faster than what I measure on my laptop (20GBs). It may be even more impressive considering perf/watt if this kind of perf can be achieved on a cool and QUIET laptop…
As you say, a large fraction of scientific computing kernels are memory bound (e.g. GMG, SpMV,…) and the potential of this architecture for SC looks amazing (HPCG Benchmark, Green500).
IMO, the other big point of interest is the presumably low latency CPU/GPU interoperability allowed by the SOC design. It may open GPGPU on a new class of problems previously eliminated by the CPU-GPU communication overheads. Looks like a super sweet spot for oneAPI/SYCL xPU programming.
If these points are confirmed, I guess that apple silicon strategy will drive the evolution of the competition (perf/watt being the key factor).
Thank you very much for your explanations !
I guess that, solving this issues may improve the robustness and quality of Julia’s implementation.
In particular, the threading issues may reveal weak assumptions and help to improve the heterogeneous cores management that seems to become the new standard (Alder Lake ?).
This quote is a bit terrifying
… the 4 high efficiency cores do not cooperate nicely with the 4 high performances cores…
The M1 pro/MAX have 10 perf cores : the scaling should be better
I just tried STREAMBenchmark.jl on my M1 and got 90 GB/s for some benchmarks on a single thread. Multithreading does not improve performance.
I too prefer (and primarily use) Linux, but I like to have the M1 around for benchmarking.
While it won’t be available in laptop chips, Intel’s upcoming Saphire Rapids will offer some chips with HBM, and some sources are speculating 1TB/s or so memory bandwidth (divided among many more cores, of course).
AMD’s 3d stacking/V-Cache will give many of their chips a very large L3 cache (which itself could have 2TB/s bandwidth per chiplet, but at 32+64 MiB is much smaller than HBM modules), which (depending on the workload) could help a great deal as well.
So still some interesting developments in x86/Linux compatible land coming in the next year.
The M1 pro/MAX have 10 perf cores
8 perf + 2 efficiency.
I may just need to test them more to get threading deadlocks, but I haven’t seen them from LoopVectorization/Polyester. My impression (having not investigated it much) is that base threading and libraries using it are at risk, particularly in code that spawns tasks relatively rapidly.
Hi @Elrod , we should ship in to get you the M1 Max model
You are comparing future server architectures with an available laptop : again, I think that the perf/watt (or bandwidth/watt) is the most relevant metric to consider.
Looks like I have been too optimistic about Makie:
- `graphplot` stackoverflow · Issue #34 · JuliaPlots/GraphMakie.jl · GitHub
- Segfaults on 1.7.0-rc1 Apple M1 · Issue #42624 · JuliaLang/julia · GitHub
These appear to be all related to SVD test segfaults on Apple M1 · Issue #41440 · JuliaLang/julia · GitHub. There is a path forward to address the problem, but someone has to do the work.
Is it a a problem with Julia implementation or a LLVM bug ?
Yeah, unfortunately there doesn’t seem to be any competition there at the moment.
#41440 problems are pretty frequent at the moment.
Personally, I assumed that the threading deadlocks aren’t due to heterogenous cores, but the weaker memory model of ARM (vs x86) + the massive out of order of the M1 exposing bugs in the threading implementation.
The M1 doesn’t deadlock when running under Rosetta (emulating x86), for example. When doing so, it uses the x86 memory model.
I see, your referring to addition of barriers like in these threads
Complex stuff… slightly above my head
It’s a bug on Julia’s side, wrong code model used, see SVD test segfaults on Apple M1 · Issue #41440 · JuliaLang/julia · GitHub and following messages.
Do threads checker tools as Detect Data Races Among Your App’s Threads help to catch some bugs ?
What about LibTask?
Turing.jl a lot in almost all of my research and Julia code. I have a M1 MacBook Air but I haven’t yet installed the
1.7-rcs because I was getting errors on installing
Turing.jl in the
But I’m still not convinced that having to build a binary library for that is a good idea
If your assumption is correct, should this weaker memory model also affect Julia execution when running on a linux parallel VM ?
I tried running the second example from the issue Mose linked.
It hung when run natively, but not when run on a Linux (AArch64) VM.
The Linux VM shouldn’t be hitting the segfaults either, so seems like that’ll be the way to go.
You mean that I should use a linux VM on apple silicon while these issues are not fixed ?
- the Julia implementation (the C++ part)
- the target architecture of clang
- clang itself
are all the same, what is the difference ? OS threads management ?
But the weaker ARM’s memory model should be the same on Linux AArch64 VM (no Rosetta 2 translation in this case) or I am missing something ?
It is surprising that such a small M(N)WE like this Darwin/ARM64: Julia freezes on nested `@threads` loops · Issue #41820 · JuliaLang/julia · GitHub brings a bug so difficult to catch. Is the thread sanitizer clang option (
-fsanitize=thread) totally useless in this case ? Are the generated machine codes very different between native and inside the ARM VM ?
OK, I realize that all these questions are probably irrelevant and may be boring coming from an outsider like me and that I should first get a M1 machine to try to catch up what is being investigated for several months now. Anyway, thank you again @Elrod and @giordano for all your explanations !