MacOS ARM64 no faster than emulated x86?

I was curious to test the performance of the newly released Julia 1.7 running natively on my M1 laptop.

Surprisingly, there doesn’t seem to be any speedup:

julia> include("b.jl")
sortperf (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime pisum();
  4.649 ms (0 allocations: 0 bytes)

julia> @btime sortperf(5000);
  250.542 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)

julia>
julia> include("b.jl")
sortperf (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime pisum();
  4.663 ms (0 allocations: 0 bytes)

julia> @btime sortperf(5000);
  268.416 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.1.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, westmere)

where b.jl is:

function pisum()
  sum = 0.0
  for j = 1:500
    sum = 0.0
    for k = 1:10000
      sum += 1.0/(k*k)
    end
  end
  sum
end

function qsort!(a,lo,hi)
  i, j = lo, hi
  while i < hi
    pivot = a[(lo+hi)>>>1]
    while i <= j
      while a[i] < pivot; i += 1; end
        while a[j] > pivot; j -= 1; end
          if i <= j
            a[i], a[j] = a[j], a[i]
            i, j = i+1, j-1
          end
        end
      if lo < j; qsort!(a,lo,j); end
    lo, j = i, hi
  end
  return a
end

sortperf(n) = qsort!(rand(n), 1, n)

(from Microbenchmarks/perf.jl at master · JuliaLang/Microbenchmarks · GitHub)

What’s going on?

DD

It’s a relatively simple benchmark and the Rosetta translation layer is very good, so the x86 code is translated to efficient ARM code. At that point, the processor is the same.

That’s pretty impressive.

I knew Rosetta isn’t actually emulating x86, and works more like LLVM backend. But can it be that good?

There’s nothing here that’s particularly hard to translate from x86 to ARM, and the examples are small enough that they’re probably all winding up in the CPU cache, so it’s fast. A better test would be something like multiplying large matrices. Or maybe run the entire MixedModels.jl testuite and compare timings (which is something I would be interested in as one of the maintainers of that package without access to a Mac).

Unfortunately, MixedModels seems to be broken on ARM64:


jarm % julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.0 (2021-11-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.7) pkg> activate .
  Activating new project at `~/jarm`

(jarm) pkg> add MixedModels
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed StatsFuns ──────── v0.9.14
   Installed BenchmarkTools ─── v1.2.0
   Installed FillArrays ─────── v0.12.7
   Installed DensityInterface ─ v0.4.0
   Installed SpecialFunctions ─ v1.8.1
   Installed Distributions ──── v0.25.34
    Updating `~/Library/Mobile Documents/com~apple~CloudDocs/work/jarm/Project.toml`
  [ff71e718] + MixedModels v4.5.0
    Updating `~/Library/Mobile Documents/com~apple~CloudDocs/work/jarm/Manifest.toml`
  [69666777] + Arrow v2.2.0
  [31f734f8] + ArrowTypes v1.2.1
  [6e4b80f9] + BenchmarkTools v1.2.0
  [c3b6d118] + BitIntegers v0.2.6
  [fa961155] + CEnum v0.4.1
  [d360d2e6] + ChainRulesCore v1.11.1
  [9e997f8a] + ChangesOfVariables v0.1.1
  [523fee87] + CodecBzip2 v0.7.2
  [5ba52731] + CodecLz4 v0.4.0
  [944b1d66] + CodecZlib v0.7.0
  [6b39b394] + CodecZstd v0.7.2
  [34da2185] + Compat v3.40.0
  [9a962f9c] + DataAPI v1.9.0
  [864edb3b] + DataStructures v0.18.10
  [e2d170a0] + DataValueInterfaces v1.0.0
  [b429d917] + DensityInterface v0.4.0
  [31c24e10] + Distributions v0.25.34
  [ffbed154] + DocStringExtensions v0.8.6
  [e2ba6199] + ExprTools v0.1.6
  [1a297f60] + FillArrays v0.12.7
  [38e38edf] + GLM v1.5.1
  [842dd82b] + InlineStrings v1.0.1
  [3587e190] + InverseFunctions v0.1.2
  [92d709cd] + IrrationalConstants v0.1.1
  [82899510] + IteratorInterfaceExtensions v1.0.0
  [692b3bcd] + JLLWrappers v1.3.0
  [682c06a0] + JSON v0.21.2
  [0f8b85d8] + JSON3 v1.9.2
  [2ab3a3ac] + LogExpFunctions v0.3.5
  [b8f27783] + MathOptInterface v0.10.6
  [fdba3010] + MathProgBase v0.7.8
  [e1d29d7a] + Missings v1.0.2
  [ff71e718] + MixedModels v4.5.0
  [78c3b35d] + Mocking v0.7.3
  [d8a4904e] + MutableArithmetics v0.3.1
  [76087f3c] + NLopt v0.6.4
  [bac558e1] + OrderedCollections v1.4.1
  [90014a1f] + PDMats v0.11.5
  [69de0a69] + Parsers v2.1.2
  [2dfb63ee] + PooledArrays v1.4.0
  [21216c6a] + Preferences v1.2.2
  [92933f4c] + ProgressMeter v1.7.1
  [1fd47b50] + QuadGK v2.4.2
  [3cdcf5f2] + RecipesBase v1.2.1
  [189a3867] + Reexport v1.2.2
  [79098fc4] + Rmath v0.7.0
  [91c51154] + SentinelArrays v1.3.8
  [1277b4bf] + ShiftedArrays v1.0.0
  [a2af1166] + SortingAlgorithms v1.0.1
  [276daf66] + SpecialFunctions v1.8.1
  [90137ffa] + StaticArrays v1.2.13
  [82ae8749] + StatsAPI v1.1.0
  [2913bbd2] + StatsBase v0.33.13
  [4c63d2b9] + StatsFuns v0.9.14
  [3eaba693] + StatsModels v0.6.28
  [856f2bd8] + StructTypes v1.8.1
  [3783bdb8] + TableTraits v1.0.1
  [bd369af6] + Tables v1.6.0
  [f269a46b] + TimeZones v1.6.2
  [3bb67fe8] + TranscodingStreams v0.9.6
  [6e34b625] + Bzip2_jll v1.0.8+0
  [5ced341a] + Lz4_jll v1.9.3+0
  [079eb43e] + NLopt_jll v2.7.0+0
  [efe28fd5] + OpenSpecFun_jll v0.5.5+0
  [f50d1b31] + Rmath_jll v0.3.0+0
  [3161d3a3] + Zstd_jll v1.5.0+0
  [0dad84c5] + ArgTools
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [8ba89e20] + Distributed
  [f43a241f] + Downloads
  [9fa8497b] + Future
  [b77e0a4c] + InteractiveUtils
  [4af54fe1] + LazyArtifacts
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [9abbd945] + Profile
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [1a1011a3] + SharedArrays
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [4607b0f0] + SuiteSparse
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [05823500] + OpenLibm_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
Precompiling project...
  ✗ NLopt
  ✗ MixedModels
  16 dependencies successfully precompiled in 26 seconds (55 already precompiled)
  2 dependencies errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the packages

I’m guessing the breakage follows from NLopt not working because NLopt is used extensively inside of MixedModels.

@stevengj can we please have a new release of nlopt that we can build for the new architectures?

2 Likes

This is probably memory bound, and the memory bandwith while in rosetta is probably very similar to native. There are also some things that are a lot slower on native than on rosetta, probably cause llvm codegen isn’t as good for aarch64 for that specific case.
Example here Very different performance on M1 mac, native vs rosetta

You might be interested in this post, which uses some more intensive examples, which are also simple to run yourself

3 Likes

Done. build nlopt 2.7.1 by stevengj · Pull Request #4010 · JuliaPackaging/Yggdrasil · GitHub

5 Likes

This is actually internally how even x86 processors work: they translate x86 on the fly to microcode which is then actually executed. It’s compilers all the way down. :turtle: :turtle: :turtle:

4 Likes

On a slight tangent:

I’m trouble understanding how to setup and run both Julia (ARM) and Julia (x86) on my machine.
With the ARM version I’m having troubles that I don’t have with the x86 version.

Wondering if/how I could be able to call them independently from the terminal as well as having each installation refer to a different ./julia folder when it comes to precompiled packages.

I’ve read about environment variables but not sure how to edit them. Is something I can do directly in the terminal or I have to dig for some config file in the Julia app contents?

I can do this, but it is a bit of a pain. You’ll need two .julia directories, one for each OS. So if you have

dotJuliaM1 and dotJuliaX86 in your home directory and
Julia-1.7.1.M1.app and Julia-1.7.1.x86.app/ in /Applications

Then from your home directory

ln -s dotJuliaM1 .julia

and from /Applications

ln -s Julia-1.7.1.M1.app Julia-1.7.app

Then run Julia-1.7 and it’ll do what you want. Same for moving from M1 to X86

And … using MKL also works on M1 with Rosetta2 and gives you a bit of speedup.

Keep in mind that the ARM version is Tier 3, so you should not be surprised if things don’t work. It is working very well for me right now, but I am ready to go back to X86 if I exercise a bug in the M1 version.

1 Like

Why? I have a single depot for both architectures, there are no conflicts.

So there are no conflicts with the different architectures in .julia/packages, .julia/artifiacts, or .julia/complied? Where do the different binaries wind up for different architectures?

In the packages directory there is the source code of the packages. In the compiled directory there are precompiled files, which are indexed by a slug which depends, among other things, on the absolute path of the sysimage of the current Julia session and of the Julia executable:

So no, there are going to be no conflicts.

1 Like

Thanks. I did not know that. That’ll save me some trouble, starting today.

A slight tangent, but has anyone compared Julia benchmarks on M1 hardware with MacOS Arm64 binaries to Linux Arm64 (either in a VM or with one of the experimental bare metal projects)?