Apple M1, M1 pro M1 Max and Julia developpers

LaurentPlagne · October 29, 2021, 3:30pm

I would like to know what Julia developers may have to share about Julia development on apple silicon (M1) machines. What is nice, what is wrong, is it possible to use GLMakie, multi-threading perfs and so on…

robsmith11 · October 29, 2021, 10:23pm

As soon as native Linux is usable for daily use (should be soon since I wouldn’t need GPU acceleration), I’d like to run some Julia benchmarks. Most of my simulations can easily use many threads but don’t scale well because they saturate memory bandwidth. But at 400 GB/s, the M1 Max has 8x the memory bandwidth of my current laptop, so it should be a big speedup.

giordano · October 29, 2021, 11:21pm

Building Julia from scratch takes 4 minutes total! When it works, it’s very snappy.

and the other issues labelled with apple silicon.

Yes, I just tried

and worked correctly. Timing:

julia> @time begin
           using GLMakie

           function sphere(n)
               u = range(0, stop=2*π, length=n)
               v = range(0, stop=π, length=n)
               x = sin.(u) * sin.(v)'
               y = cos.(u) * sin.(v)'
               z = ones(n) * cos.(v)'
               return (x,y,z)
           end

           (X,Y,Z) = sphere(201)
           R = 1 .- (1 .- mod.(0:0.1:20,2)) .^ 2/15
           RX = R .* X
           RY = R .* Y
           RZ = (0.8 .+ (0 .- (1:-0.01:-1)' .^ 4) * 0.2) .* Z .* R

           scene = Scene(show_axis=false)
           surface!(scene, RX, RY, RZ, color=fill("#ff7518",1,1))
           surface!(scene, X/12, Y/12, Z/2 .+ 0.4, color=fill("#080",1,1))
           display(scene)
       end
 14.681251 seconds (67.44 M allocations: 3.864 GiB, 5.15% gc time, 69.73% compilation time)

Good, when it doesn’t deadlock. Using more than 4 threads may not be beneficial though (at least with the original M1)

LaurentPlagne · October 30, 2021, 8:03am

This is exactly why I am interested by this machine that overcomes my natural reluctance to closed OS as a Linux user. The (announced) CPU memory bandwidth is 20x faster than what I measure on my laptop (20GBs). It may be even more impressive considering perf/watt if this kind of perf can be achieved on a cool and QUIET laptop…

As you say, a large fraction of scientific computing kernels are memory bound (e.g. GMG, SpMV,…) and the potential of this architecture for SC looks amazing (HPCG Benchmark, Green500).

IMO, the other big point of interest is the presumably low latency CPU/GPU interoperability allowed by the SOC design. It may open GPGPU on a new class of problems previously eliminated by the CPU-GPU communication overheads. Looks like a super sweet spot for oneAPI/SYCL xPU programming.

If these points are confirmed, I guess that apple silicon strategy will drive the evolution of the competition (perf/watt being the key factor).

LaurentPlagne · October 30, 2021, 8:20am

Thank you very much for your explanations !
I guess that, solving this issues may improve the robustness and quality of Julia’s implementation.
In particular, the threading issues may reveal weak assumptions and help to improve the heterogeneous cores management that seems to become the new standard (Alder Lake ?).

This quote is a bit terrifying

… the 4 high efficiency cores do not cooperate nicely with the 4 high performances cores…
The M1 pro/MAX have 10 perf cores : the scaling should be better

Elrod · October 30, 2021, 8:30am

I just tried STREAMBenchmark.jl on my M1 and got 90 GB/s for some benchmarks on a single thread. Multithreading does not improve performance.
I too prefer (and primarily use) Linux, but I like to have the M1 around for benchmarking.

While it won’t be available in laptop chips, Intel’s upcoming Saphire Rapids will offer some chips with HBM, and some sources are speculating 1TB/s or so memory bandwidth (divided among many more cores, of course).
AMD’s 3d stacking/V-Cache will give many of their chips a very large L3 cache (which itself could have 2TB/s bandwidth per chiplet, but at 32+64 MiB is much smaller than HBM modules), which (depending on the workload) could help a great deal as well.
So still some interesting developments in x86/Linux compatible land coming in the next year.

The M1 pro/MAX have 10 perf cores

8 perf + 2 efficiency.

I may just need to test them more to get threading deadlocks, but I haven’t seen them from LoopVectorization/Polyester. My impression (having not investigated it much) is that base threading and libraries using it are at risk, particularly in code that spawns tasks relatively rapidly.

LaurentPlagne · October 30, 2021, 8:39am

Hi @Elrod , we should ship in to get you the M1 Max model
You are comparing future server architectures with an available laptop : again, I think that the perf/watt (or bandwidth/watt) is the most relevant metric to consider.

giordano · October 30, 2021, 8:39am

Looks like I have been too optimistic about Makie:

These appear to be all related to https://github.com/JuliaLang/julia/issues/41440. There is a path forward to address the problem, but someone has to do the work.

LaurentPlagne · October 30, 2021, 8:42am

Is it a a problem with Julia implementation or a LLVM bug ?

Elrod · October 30, 2021, 8:42am

Yeah, unfortunately there doesn’t seem to be any competition there at the moment.

#41440 problems are pretty frequent at the moment.

Personally, I assumed that the threading deadlocks aren’t due to heterogenous cores, but the weaker memory model of ARM (vs x86) + the massive out of order of the M1 exposing bugs in the threading implementation.
The M1 doesn’t deadlock when running under Rosetta (emulating x86), for example. When doing so, it uses the x86 memory model.

LaurentPlagne · October 30, 2021, 9:02am

I see, your referring to addition of barriers like in these threads

Complex stuff… slightly above my head

giordano · October 30, 2021, 9:03am

It’s a bug on Julia’s side, wrong code model used, see https://github.com/JuliaLang/julia/issues/41440#issuecomment-932048448 and following messages.

LaurentPlagne · October 30, 2021, 10:00am

Do threads checker tools as Detect Data Races Among Your App’s Threads help to catch some bugs ?

Storopoli · October 30, 2021, 10:18am

What about LibTask?

I use Turing.jl a lot in almost all of my research and Julia code. I have a M1 MacBook Air but I haven’t yet installed the 1.7-rcs because I was getting errors on installing Turing.jl in the 1.7-betas.

giordano · October 30, 2021, 10:40am

It’s stalling

But I’m still not convinced that having to build a binary library for that is a good idea

github.com/TuringLang/Libtask.jl

Why not ccalling directly into libjulia instead of having `Libtask_jll`?

opened 03:04PM - 12 Jun 21 UTC

closed 11:17AM - 02 Dec 21 UTC

giordano

Maintaining a library like `Libtask_jll` which links to libjulia isn't super sim…ple as you need to have libjulia_jll available in the first place and then build against all different variations. Why don't you `ccall` directly into libjulia from Julia, instead of delegating the same work to an external library? This would have the advantage of having `Libtask.jl` more readily available on all platforms and versions of Julia without depending on external libraries.

LaurentPlagne · October 31, 2021, 8:43am

If your assumption is correct, should this weaker memory model also affect Julia execution when running on a linux parallel VM ?

Elrod · November 1, 2021, 5:53am

Hmm.
I tried running the second example from the issue Mose linked.
It hung when run natively, but not when run on a Linux (AArch64) VM.

The Linux VM shouldn’t be hitting the segfaults either, so seems like that’ll be the way to go.

LaurentPlagne · November 1, 2021, 8:37am

You mean that I should use a linux VM on apple silicon while these issues are not fixed ?

If

the Julia implementation (the C++ part)
the target architecture of clang
clang itself

are all the same, what is the difference ? OS threads management ?

But the weaker ARM’s memory model should be the same on Linux AArch64 VM (no Rosetta 2 translation in this case) or I am missing something ?

It is surprising that such a small M(N)WE like this Darwin/ARM64: Julia freezes on nested `@threads` loops · Issue #41820 · JuliaLang/julia · GitHub brings a bug so difficult to catch. Is the thread sanitizer clang option (-fsanitize=thread) totally useless in this case ? Are the generated machine codes very different between native and inside the ARM VM ?

OK, I realize that all these questions are probably irrelevant and may be boring coming from an outsider like me and that I should first get a M1 machine to try to catch up what is being investigated for several months now. Anyway, thank you again @Elrod and @giordano for all your explanations !

Topic		Replies	Views
Apple silicon full power Performance hardware , apple	19	6811	November 18, 2021
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3604	May 16, 2021
Taking advantage of Apple M1? Performance mac-m1 , hardware	27	5603	November 10, 2023
Show off Julia performance on your PC! Performance	53	4502	April 26, 2020
Possible threading bug on M1 Max (only ARM build) Performance multithreading	19	1589	March 12, 2022

Apple M1, M1 pro M1 Max and Julia developpers

Related topics