Julia 1.7 on M1 is Incredible?

cortner · December 22, 2021, 3:43pm

I recently spent a day fixing performance bottlenecks from my group’s research code (not important I think but I’m referring to ACE.jl). I had some unexpected experiences and am hoping somebody will help me understand? I work on Julia 1.7 on my M1 MacBook pro. After I was done optimizing I then tested also on Julia 1.6 on the M1, and on both 1.6 and 1.7 on an EPYC 7702 workstation.

The basic take-away: Julia 1.7 optimizes much better than 1.6 (my codes anyhow), and Julia 1.7 optimizes MUCH MUCH better on the M1 than on the EPYC. (what?!?)

Small Surprise: 1.6 added an allocation close to some hot loops and the performance dropped by several factors. Nice that Julia keeps getting more and more clever about optimizing, I thought. But I was a bit surprised about this particular piece of code (see below).
Big Surprise: On the EPYC workstation the results were similar for BOTH 1.6 and 1.7 as they were for 1.6 on the M1. That is, on 1.7, I had the same allocation problem as on 1.6. This only occurred on the EPYC but not on the M1. I have no explanation for this whatsoever?
Medium Surprise: I fixed the allocations (again, see below) on all systems and Julia versions. Even now, the code on the M1 runs about a factor 3 faster than on the EPYC. Moreover, Julia 1.7 code runs about a 10-20% faster on both systems (nice!).

The last point is actually of practical importance and not just curiosity. Where do I need to start to fix the performance on the EPYC? Clock speeds are roughly the same, so is it the increases memory bandwidth of the M1 (almost double)? Cache? Something entirely different?

A little more detail:

The specific piece of code that the points above are referring to goes something like this (it is a little simplified, but I think the gist is right): this returns a thread-safe pre-allocated temporary array (if available on a stack inside basis.pool) and otherwise allocates one; the intention is that one almost always reuses the pre-allocated arrays, but to have a fall-back for faster development or testing when needed.

acquire!(basis, T) = hasproperty(basis, :pool) ? acquire!(basis.pool, T) : zeros(T, length(basis))
release!(basis, A) = hasproperty(basis, :pool) ? release!(basis.pool, A) : nothing.

On all systems except 1.7 on M1, in order for my code to not allocate, I had to replace these functions with @generated functions that “manually” resolved the if hasproperty(basis, :pool) ... . The actual code is here just in case: [code], [tests].

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC-Rome Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

ChrisRackauckas · December 22, 2021, 3:46pm

@Elrod mentioned that it might be the cache sizes. We saw M1 dominating PDE benchmarks.

https://github.com/SciML/MethodOfLines.jl/issues/2#issuecomment-862807968

pazzo83 · December 22, 2021, 4:00pm

Yeah, 1.7 on my M1 has been awesome. Pretty much the only issue (and why I still use the x86 version of 1.6.* occasionally) is some issues with multithreading, but it looks like that is being addressed in 1.7.1.

lawless-m · December 23, 2021, 3:50pm

You are not the first to notice

It even prompted me to look up the price of a new Macbook. I think I can wait 2x as long for my results

cortner · December 23, 2021, 4:45pm

Thanks for this link and the comments.

Still - my main surprise and puzzle is that J17 appears to produce different code in the M1 than in the EPYC, with fewer allocations (possibly due to improved constant propagation, but I’m not certain if this).

A different LLVM? Or does the compiler decide at random how much effort to put into optimising the code?

ctkelley · December 29, 2021, 10:26pm

I have the 2020 MacBook with the original M1 chip and 1.7.1 is 30% – 50% faster than the 2019 8 core Intel IMac I have in my office. This is with openblas on the M1 and MKL on the Intel.

Eagerly awaiting a vendor blas for apple chips.

Topic		Replies	Views
Taking advantage of Apple M1? Performance mac-m1 , hardware	27	5447	November 10, 2023
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3533	May 16, 2021
Apple M1, M1 pro M1 Max and Julia developpers Offtopic	17	5414	November 1, 2021
Julia 1.7 even got a special version for M1, does that mean it is a hypertime to buy a Macbook pro? New to Julia	6	1968	December 30, 2021
Very different performance on M1 mac, native vs rosetta Performance mac-m1	14	3293	September 20, 2023

Julia 1.7 on M1 is Incredible?

A little more detail:

Related topics