I recently spent a day fixing performance bottlenecks from my group’s research code (not important I think but I’m referring to ACE.jl). I had some unexpected experiences and am hoping somebody will help me understand? I work on Julia 1.7 on my M1 MacBook pro. After I was done optimizing I then tested also on Julia 1.6 on the M1, and on both 1.6 and 1.7 on an EPYC 7702 workstation.
The basic take-away: Julia 1.7 optimizes much better than 1.6 (my codes anyhow), and Julia 1.7 optimizes MUCH MUCH better on the M1 than on the EPYC. (what?!?)
- Small Surprise: 1.6 added an allocation close to some hot loops and the performance dropped by several factors. Nice that Julia keeps getting more and more clever about optimizing, I thought. But I was a bit surprised about this particular piece of code (see below).
- Big Surprise: On the EPYC workstation the results were similar for BOTH 1.6 and 1.7 as they were for 1.6 on the M1. That is, on 1.7, I had the same allocation problem as on 1.6. This only occurred on the EPYC but not on the M1. I have no explanation for this whatsoever?
- Medium Surprise: I fixed the allocations (again, see below) on all systems and Julia versions. Even now, the code on the M1 runs about a factor 3 faster than on the EPYC. Moreover, Julia 1.7 code runs about a 10-20% faster on both systems (nice!).
The last point is actually of practical importance and not just curiosity. Where do I need to start to fix the performance on the EPYC? Clock speeds are roughly the same, so is it the increases memory bandwidth of the M1 (almost double)? Cache? Something entirely different?
A little more detail:
The specific piece of code that the points above are referring to goes something like this (it is a little simplified, but I think the gist is right): this returns a thread-safe pre-allocated temporary array (if available on a stack inside basis.pool
) and otherwise allocates one; the intention is that one almost always reuses the pre-allocated arrays, but to have a fall-back for faster development or testing when needed.
acquire!(basis, T) = hasproperty(basis, :pool) ? acquire!(basis.pool, T) : zeros(T, length(basis))
release!(basis, A) = hasproperty(basis, :pool) ? release!(basis.pool, A) : nothing.
On all systems except 1.7 on M1, in order for my code to not allocate, I had to replace these functions with @generated
functions that “manually” resolved the if hasproperty(basis, :pool) ...
. The actual code is here just in case: [code], [tests].
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.1.0)
CPU: Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD EPYC-Rome Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver2)