Julia 1.7 on M1 is Incredible?

I recently spent a day fixing performance bottlenecks from my group’s research code (not important I think but I’m referring to ACE.jl). I had some unexpected experiences and am hoping somebody will help me understand? I work on Julia 1.7 on my M1 MacBook pro. After I was done optimizing I then tested also on Julia 1.6 on the M1, and on both 1.6 and 1.7 on an EPYC 7702 workstation.

The basic take-away: Julia 1.7 optimizes much better than 1.6 (my codes anyhow), and Julia 1.7 optimizes MUCH MUCH better on the M1 than on the EPYC. (what?!?)

  1. Small Surprise: 1.6 added an allocation close to some hot loops and the performance dropped by several factors. Nice that Julia keeps getting more and more clever about optimizing, I thought. But I was a bit surprised about this particular piece of code (see below).
  2. Big Surprise: On the EPYC workstation the results were similar for BOTH 1.6 and 1.7 as they were for 1.6 on the M1. That is, on 1.7, I had the same allocation problem as on 1.6. This only occurred on the EPYC but not on the M1. I have no explanation for this whatsoever?
  3. Medium Surprise: I fixed the allocations (again, see below) on all systems and Julia versions. Even now, the code on the M1 runs about a factor 3 faster than on the EPYC. Moreover, Julia 1.7 code runs about a 10-20% faster on both systems (nice!).

The last point is actually of practical importance and not just curiosity. Where do I need to start to fix the performance on the EPYC? Clock speeds are roughly the same, so is it the increases memory bandwidth of the M1 (almost double)? Cache? Something entirely different?

A little more detail:

The specific piece of code that the points above are referring to goes something like this (it is a little simplified, but I think the gist is right): this returns a thread-safe pre-allocated temporary array (if available on a stack inside basis.pool) and otherwise allocates one; the intention is that one almost always reuses the pre-allocated arrays, but to have a fall-back for faster development or testing when needed.

acquire!(basis, T) = hasproperty(basis, :pool) ? acquire!(basis.pool, T) : zeros(T, length(basis))
release!(basis, A) = hasproperty(basis, :pool) ? release!(basis.pool, A) : nothing.

On all systems except 1.7 on M1, in order for my code to not allocate, I had to replace these functions with @generated functions that “manually” resolved the if hasproperty(basis, :pool) ... . The actual code is here just in case: [code], [tests].

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC-Rome Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
2 Likes

@Elrod mentioned that it might be the cache sizes. We saw M1 dominating PDE benchmarks.

2 Likes

Yeah, 1.7 on my M1 has been awesome. Pretty much the only issue (and why I still use the x86 version of 1.6.* occasionally) is some issues with multithreading, but it looks like that is being addressed in 1.7.1.

2 Likes

You are not the first to notice

It even prompted me to look up the price of a new Macbook. I think I can wait 2x as long for my results :slight_smile:

Thanks for this link and the comments.

Still - my main surprise and puzzle is that J17 appears to produce different code in the M1 than in the EPYC, with fewer allocations (possibly due to improved constant propagation, but I’m not certain if this).

A different LLVM? Or does the compiler decide at random how much effort to put into optimising the code?

I have the 2020 MacBook with the original M1 chip and 1.7.1 is 30% – 50% faster than the 2019 8 core Intel IMac I have in my office. This is with openblas on the M1 and MKL on the Intel.

Eagerly awaiting a vendor blas for apple chips.