Speeding up julia on aarch64

I’m looking to speed up julia on aarch64, and thought I’d write up attempts so far to see if anyone had any tips? Thanks @staticfloat for the pointers already

It seems that julia loading runs slower than it should on aarch64, when normalizing for compute power.

As an example, on a Nvidia Jetson Xavier NX, which has a 6-core NVIDIA Carmel 64-bit ARMv8.2 @ 1400MHz* (6MB L2 + 4MB L3) CPU, 8GB RAM

Here’s a CPU benchmark of the closely similar Jetson AGX Xavier, for reference (source)


On julia 1.4.1

julia> @time using Flux
 64.351722 seconds (55.22 M allocations: 2.917 GiB, 2.92% gc time)

with precompilation taking 7-8 minutes…
A PackageCompiler.jl Flux Sysimage loads with julia in 20 seconds, which is also a lot slower.

As a benchmark, a 2018 Macbook Pro, 2.6 GHz 6-Core Intel Core i7 CPU, 16 GB RAM

julia> @time using Flux
 18.150855 seconds (51.58 M allocations: 2.767 GiB, 4.34% gc time)

In looking for ways to speed it up:

  1. I came across this write up from ARM that concludes:

As long as you’re not cross compiling, the simplest and easiest way to get the best performance on Arm with both GNU compilers and LLVM-compilers is to use only -mcpu=native and actively avoid using -mtune or -march.

So I tried building julia 1.4.1 with a Make.user of


and got a 11% improvement, which might be above the noise

julia> @time using Flux
 56.904661 seconds (54.46 M allocations: 2.882 GiB, 3.13% gc time)
  1. Additionally, and perhaps most importantly(?) LLVM seems to be lacking support for the Carmel chipsets, but there is a LLVM PR for Nvidia Carmel support.

  2. Profiling using Flux didn’t highlight any particular bottlenecks

What else can I try?


I wonder if it’s related to this bug, which I still experience on aarch64, but not on x86_64:

A few more data points from an old Snapdragon 820 tablet and a Skylake i7. Importing Flux is nearly 20x slower, while most other operations are only 3-5x slower.

julia> @time using Flux
 72.300242 seconds (44.39 M allocations: 2.320 GiB, 3.44% gc time)

julia> @time using DataFrames
  4.436771 seconds (2.45 M allocations: 114.768 MiB, 2.69% gc time)

julia> @btime sum(rand(10^6))
  3.306 ms (2 allocations: 7.63 MiB)

julia> @btime sum(rand(1:10, 10^6))
  31.906 ms (2 allocations: 7.63 MiB)

julia> versioninfo(verbose=true)
Julia Version 1.5.0-DEV.645
Commit c3fc36707c (2020-04-18 20:13 UTC)
Platform Info:
  OS: Linux (aarch64-unknown-linux-gnu)
  uname: Linux 3.18.71-gbf3ecfa #1 SMP PREEMPT Thu Jul 19 15:46:30 +03 2018 aarch64 unknown
  CPU: unknown:
              speed         user         nice          sys         idle          irq
       #1   307 MHz    2209616 s      23588 s    2480524 s   49892174 s     958221 s
       #2   307 MHz    1944072 s      25940 s    2376852 s   51400837 s     208623 s
       #3  2150 MHz    2148291 s      53418 s    1838297 s   52022257 s     154016 s
       #4  2150 MHz    1368769 s      48719 s    1404845 s   52980727 s     130900 s

  Memory: 3.3130035400390625 GB (1280.02734375 MB free)
  Uptime: 585870.0 sec
  Load Avg:  5.95263671875  6.22265625  5.86669921875
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, kryo)
  HOME = /home/me
  TERM = screen-256color
  MOZ_PLUGIN_PATH = /usr/lib/mozilla/plugins
  PATH = /usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
julia> @time using Flux
  3.708332 seconds (10.07 M allocations: 551.496 MiB, 3.06% gc time)

julia> @time using DataFrames
  0.866196 seconds (2.41 M allocations: 112.536 MiB, 3.22% gc time)

julia> @btime sum(rand(10^6))
  982.687 μs (2 allocations: 7.63 MiB) 

julia> @btime sum(rand(1:10, 10^6))
  10.295 ms (2 allocations: 7.63 MiB)

julia> versioninfo(verbose=true)
Julia Version 1.5.0-DEV.609
Commit 8a55a27ea7 (2020-04-10 01:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  uname: Linux 5.1.5-arch1-2-ARCH #1 SMP PREEMPT Mon May 27 03:37:39 UTC 2019 x86_64 unknown
  CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz:
              speed         user         nice          sys         idle          irq
       #1  4104 MHz   28836168 s       3380 s    7125787 s  2783937058 s     604522 s
       #2  4166 MHz   28995411 s       3485 s    7049353 s  2783573188 s     683347 s
       #3  4100 MHz   13826832 s       3722 s     433086 s  2821434106 s      90100 s
       #4  4099 MHz   13723607 s       3959 s     401989 s  2821654196 s      93240 s
       #5  4103 MHz   14201429 s       3601 s     387999 s  2821305106 s      82883 s
       #6  4149 MHz   14004334 s       4130 s     385735 s  2821400115 s     101692 s
       #7  4100 MHz   13685984 s       3643 s     421938 s  2816115330 s     919846 s
       #8  4194 MHz   13538851 s       4268 s     482884 s  2821210828 s      98332 s

  Memory: 62.701908111572266 GB (15438.20703125 MB free)
  Uptime: 2.8366552e7 sec
  Load Avg:  0.33935546875  0.3837890625  0.4228515625
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
  CMDSTAN_HOME = /home/me/cmdstan-2.19.1
  HOME = /home/me
  TERM = screen-256color
  PATH = /home/me/.cargo/bin:/home/me/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
1 Like

Interesting… your i7 Linux 3.7 seconds vs. my i7 macOS 18.2 seconds alone highlights there’s something going on with flux here

On julia master, with aarch64

julia> @time using Flux
 52.568274 seconds (51.52 M allocations: 2.667 GiB, 2.85% gc time)

I also ran the inference analysis outlined in https://github.com/JuliaLang/julia/pull/34613 for using Flux which gave this ranked list (highest inference effort first) https://gist.github.com/ianshmean/804dda99dc7a648d702879f8e7377fd1

Top 25:

(Tuple{typeof(CuArrays._functional),Bool}, 4260)
(Tuple{typeof(lock),CuArrays.var"#1#2"{Bool},ReentrantLock}, 4259)
(Tuple{CuArrays.var"#1#2"{Bool}}, 4255)
(Tuple{typeof(CuArrays.__configure__),Bool}, 4082)
(Tuple{typeof(lock),CUDAnative.var"#1#2"{Bool},ReentrantLock}, 3783)
(Tuple{CUDAnative.var"#1#2"{Bool}}, 3783)
(Tuple{typeof(CUDAnative.__runtime_init__)}, 3783)
(Tuple{typeof(CUDAnative.release)}, 3783)
(Tuple{typeof(CUDAnative.version)}, 3783)
(Tuple{typeof(CUDAnative.functional),Bool}, 3783)
(Tuple{typeof(CUDAnative._functional),Bool}, 3783)
(Tuple{typeof(CUDAnative.cuda_compat)}, 3783)
(Tuple{typeof(CUDAnative.__configure__),Bool}, 3358)
(Tuple{typeof(CUDAnative.__configure_dependencies__),Bool}, 3342)
(Tuple{typeof(CuArrays.__configure_dependencies__),Bool}, 3239)
(Tuple{typeof(Pkg.Artifacts.do_artifact_str),String,Dict{String,Any},String,Module}, 2609)
(Tuple{CuArrays.BinnedPool.var"#4#5"}, 2607)
(Tuple{Pkg.Artifacts.var"#ensure_artifact_installed##kw",NamedTuple{(:platform,),_A} where _A<:Tuple,typeof(Pkg.Artifacts.ensure_artifact_installed),String,Dict,String}, 2572)
(Tuple{Pkg.Artifacts.var"#ensure_artifact_installed##kw",NamedTuple{(:platform,),Tuple{Pkg.BinaryPlatforms.Linux}},typeof(Pkg.Artifacts.ensure_artifact_installed),String,Dict{String,Any},String}, 2554)
(Tuple{Pkg.Artifacts.var"##ensure_artifact_installed#42",Pkg.BinaryPlatforms.Linux,Bool,Bool,typeof(Pkg.Artifacts.ensure_artifact_installed),String,Dict{String,Any},String}, 2548)
(Tuple{Pkg.Artifacts.var"##ensure_artifact_installed#42",Pkg.BinaryPlatforms.Platform,Bool,Bool,typeof(Pkg.Artifacts.ensure_artifact_installed),String,Dict{String,Any},String}, 2548)
(Tuple{Pkg.Artifacts.var"##ensure_artifact_installed#42",Pkg.BinaryPlatforms.Platform,Bool,Bool,typeof(Pkg.Artifacts.ensure_artifact_installed),String,Dict,String}, 2548)
(Tuple{typeof(CUDAnative.use_local_cuda)}, 2529)
(Tuple{typeof(Pkg.Artifacts.with_show_download_info),Pkg.Artifacts.var"#44#46"{Bool,Bool,Base.SHA1,_A} where _A,String,Bool}, 2515)
(Tuple{typeof(CuArrays.use_local_cuda)}, 2374)

Some testing beyond Flux, and one beyond Julia

									MacOS	Linux aarch64	How much slower
@time using Flux					26.74s	52.57s			2.0
@time using ColorTypes				0.46s	1.34s			2.9
@time using CSV						0.87s	2.54s			2.9
@time using VideoIO					3.95s	15.70s			4.0
@btime $x^2 						33.59μs	138.91μs		4.1
echo '2^2^20' | time bc > /dev/null	2.48s	3.54s			1.4

That last test is probably quite basic, but I wanted something non-julia that was easy and worked on mac and linux.

It does seem like 4x slower for the @btime test is particularly slow.
As for Flux, it also seems like something is slowing it down on MacOS in particular

Just a guess, but could the difference between your MacOS and my Linux i7 loading times for Flux be related to the presence of CUDA drivers? I have no GPU, so perhaps it is smart enough to skip some things.

My Mac doesn’t have an Nvidia gpu, so I believe that should be effectively the same as having no gpu.

An 11% speedup sounds good to me!

I had a chance to work on a Cavium thunder 2 system a year or so back. I did compile Julia as march=native, but sorry to report I did not do much benchmarking.
Something to note about ARM is that there are many variants, as you report.
It can be a bit of a nightmare to negotiate through that.

I agree with you though - if you have a powerful ARM variant then using a native compiled Julia will serve better than the pre-compiled version.
This is not to take away from the rather wonderful finding that you can download Julia directly onto a Raspberry Pi - which means that there is a very low bar to entry for Julia development.
I venture to say that very, very few people would download source and compile for Rasp Pi
(I did this on the Jetson Nano and IIRC it took overnight).

I just remembered… I went to a talk on archspec at FOSDEM.
Archspec aims at providing a standard set of human-understandable labels for various aspects of a system architecture like CPU, network fabrics, etc. and APIs to detect, query and compare them.

Talk is here:

Not by much.

Julia already have all of that.

Not by much.

I’d love to understand a bit more why that’s the case. Is it experience-based or logical?

I’m also unclear whether this LLVM PR https://reviews.llvm.org/D77940 would help speed up mcpu=native further. Is native going to be able to identify the technologies available even if the chip family is missing? Is that PR just for building fast cross-compilation targets?

native code generation is used for all user code. The default compilation target sets for sysimg should cover all the important features already so the only thing missing would be instruction scheduling. I don’t think that usualy matters by “much” among top end chips. (I do acknowledge that different ARM vendors will have more difference in performance model compared to x86 though it seems that the server market is somewhat converging back onn ARM cores again?) If that matters that much for a particular vendor, you can request it to be added to the compilation target for the binary release.

I’m also unclear whether this LLVM PR https://reviews.llvm.org/D77940 would help speed up mcpu=native further. Is native going to be able to identify the technologies available even if the chip family is missing? Is that PR just for building fast cross-compilation targets?

Does it speed up native? I dont see how it can without a peformance model (as mentioned above). It should make it easier for “cross-compiling” (though I don’t think that’s the word typically used for different uarch…) but even that is less necessary for julia since we allow more flexible and universal target feature specifications.

1 Like

Interesting. When you say native is used for user code is it via march=native? If so, according to the arm article above that should be mcpu=native. Perhaps that’s room for improvement?

No. There’s no equivalence of such options differece in julia.

Ok. Thanks for the explanation.

I’ve opened https://github.com/JuliaLang/julia/pull/35617 and https://github.com/JuliaCI/julia-buildbot/pull/149 to support and switch over to using mcpu instead of march on ARM.

I’m currently building that julia branch locally with USE_BINARYBUILDER_LLVM=0 and MCPU=native on this aarch64 system.
Without the change to Julia’s Makefile, I don’t believe my previous attempts to get MCPU=native into CXXFLAGS were actually effective… The noise in the timed test must’ve just happened to show ~11% increase over a few repeats :man_shrugging:
On further inspection the previous method of setting CXXFLAGS seems to have been valid.
I get a very similar result with the new build

julia> @time using Flux
53.3324 seconds (44.29 M allocations: 2.393 GiB, 2.76% gc time)

Beyond that, I’m not aware of anything else to try

In trying to make this easier to diagnose, I made a package for running a small set of fixed benchmarks to assess performance across systems.

I added a baked-in reference result from my 2018 i7 Macbook pro, which can be compared against. Probably better to compare to a linux x86_64 machine, so I plan to switch it out.

Here this is being run on the Xavier NX aarch64 system, and compared to the reference MacOS system.

julia> using SystemBenchmark
julia> compareToRef(sysbenchmark())
13×5 DataFrames.DataFrame
│ Row │ cat     │ testname        │ ref_ms      │ res_ms      │ factor   │
│     │ String  │ String          │ Float64     │ Float64     │ Float64  │
│ 1   │ cpu     │ FloatMul        │ 1.61e-6     │ 6.08e-7     │ 0.37764  │
│ 2   │ cpu     │ FloatSin        │ 5.681e-6    │ 8.68342e-6  │ 1.5285   │
│ 3   │ cpu     │ VecMulBroad     │ 4.72799e-5  │ 5.15783e-5  │ 1.09091  │
│ 4   │ cpu     │ MatMul          │ 0.000379541 │ 0.00091201  │ 2.40293  │
│ 5   │ cpu     │ MatMulBroad     │ 0.000165929 │ 0.000199591 │ 1.20287  │
│ 6   │ cpu     │ 3DMulBroad      │ 0.00184215  │ 0.0018017   │ 0.978042 │
│ 7   │ cpu     │ FFMPEGH264Write │ 230.533     │ 616.325     │ 2.67348  │
│ 8   │ mem     │ DeepCopy        │ 0.000207828 │ 0.000339916 │ 1.63556  │
│ 9   │ diskio  │ TempdirWrite    │ 0.196437    │ 0.070913    │ 0.360997 │
│ 10  │ diskio  │ TempdirRead     │ 0.0691485   │ 0.0176      │ 0.254525 │
│ 11  │ loading │ JuliaLoad       │ 282.547     │ 246.116     │ 0.871063 │
│ 12  │ loading │ UsingCSV        │ 1772.47     │ 3065.72     │ 1.72963  │
│ 13  │ loading │ UsingVideoIO    │ 4002.58     │ 15329.0     │ 3.82978  │


  • Generally not too bad
  • Surprisingly faster on diskio
  • Matrix multiplication is slow
  • FFMPEG is slow to encode
  • Loading VideoIO is slow (perhaps an artifact code precompile invalidation issue?)

As for SystemBenchmark.jl, I’d definitely welcome suggestions/PRs on how to improve the tests or informative tests to add. It would be great to arrive at a fixed set of tests that can be relied on.
I also want to add some external non-julia benchmarks, but couldn’t find a nice cross-platform benchmarking lib that could be JLL-ed.

To be reliable in the long-run, this package should also lock down version numbers and capture system data.