I’m looking to speed up julia on aarch64, and thought I’d write up attempts so far to see if anyone had any tips? Thanks @staticfloat for the pointers already
It seems that julia loading runs slower than it should on aarch64, when normalizing for compute power.
As an example, on a Nvidia Jetson Xavier NX, which has a 6-core NVIDIA Carmel 64-bit ARMv8.2 @ 1400MHz* (6MB L2 + 4MB L3) CPU, 8GB RAM
Here’s a CPU benchmark of the closely similar Jetson AGX Xavier, for reference (source)
On julia 1.4.1
julia> @time using Flux
64.351722 seconds (55.22 M allocations: 2.917 GiB, 2.92% gc time)
with precompilation taking 7-8 minutes…
A PackageCompiler.jl
Flux Sysimage loads with julia in 20 seconds, which is also a lot slower.
As a benchmark, a 2018 Macbook Pro, 2.6 GHz 6-Core Intel Core i7 CPU, 16 GB RAM
julia> @time using Flux
18.150855 seconds (51.58 M allocations: 2.767 GiB, 4.34% gc time)
In looking for ways to speed it up:
- I came across this write up from ARM that concludes:
As long as you’re not cross compiling, the simplest and easiest way to get the best performance on Arm with both GNU compilers and LLVM-compilers is to use only -mcpu=native and actively avoid using -mtune or -march.
So I tried building julia 1.4.1 with a Make.user of
USE_BINARYBUILDER_LLVM=0
CXXFLAGS=-DMCPU=native
and got a 11% improvement, which might be above the noise
julia> @time using Flux
56.904661 seconds (54.46 M allocations: 2.882 GiB, 3.13% gc time)
-
Additionally, and perhaps most importantly(?) LLVM seems to be lacking support for the Carmel chipsets, but there is a LLVM PR for Nvidia Carmel support.
-
Profiling
using Flux
didn’t highlight any particular bottlenecks
What else can I try?