MacOS ARM64 no faster than emulated x86?

dodoplus · December 2, 2021, 6:17pm

I was curious to test the performance of the newly released Julia 1.7 running natively on my M1 laptop.

Surprisingly, there doesn’t seem to be any speedup:

julia> include("b.jl")
sortperf (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime pisum();
  4.649 ms (0 allocations: 0 bytes)

julia> @btime sortperf(5000);
  250.542 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)

julia>

julia> include("b.jl")
sortperf (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime pisum();
  4.663 ms (0 allocations: 0 bytes)

julia> @btime sortperf(5000);
  268.416 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.1.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, westmere)

where b.jl is:

function pisum()
  sum = 0.0
  for j = 1:500
    sum = 0.0
    for k = 1:10000
      sum += 1.0/(k*k)
    end
  end
  sum
end

function qsort!(a,lo,hi)
  i, j = lo, hi
  while i < hi
    pivot = a[(lo+hi)>>>1]
    while i <= j
      while a[i] < pivot; i += 1; end
        while a[j] > pivot; j -= 1; end
          if i <= j
            a[i], a[j] = a[j], a[i]
            i, j = i+1, j-1
          end
        end
      if lo < j; qsort!(a,lo,j); end
    lo, j = i, hi
  end
  return a
end

sortperf(n) = qsort!(rand(n), 1, n)

(from https://github.com/JuliaLang/Microbenchmarks/blob/master/perf.jl)

What’s going on?

DD

palday · December 2, 2021, 6:29pm

It’s a relatively simple benchmark and the Rosetta translation layer is very good, so the x86 code is translated to efficient ARM code. At that point, the processor is the same.

dodoplus · December 2, 2021, 6:40pm

That’s pretty impressive.

I knew Rosetta isn’t actually emulating x86, and works more like LLVM backend. But can it be that good?

palday · December 2, 2021, 6:42pm

There’s nothing here that’s particularly hard to translate from x86 to ARM, and the examples are small enough that they’re probably all winding up in the CPU cache, so it’s fast. A better test would be something like multiplying large matrices. Or maybe run the entire MixedModels.jl testuite and compare timings (which is something I would be interested in as one of the maintainers of that package without access to a Mac).

dodoplus · December 2, 2021, 6:54pm

Unfortunately, MixedModels seems to be broken on ARM64:


jarm % julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.0 (2021-11-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.7) pkg> activate .
  Activating new project at `~/jarm`

(jarm) pkg> add MixedModels
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed StatsFuns ──────── v0.9.14
   Installed BenchmarkTools ─── v1.2.0
   Installed FillArrays ─────── v0.12.7
   Installed DensityInterface ─ v0.4.0
   Installed SpecialFunctions ─ v1.8.1
   Installed Distributions ──── v0.25.34
    Updating `~/Library/Mobile Documents/com~apple~CloudDocs/work/jarm/Project.toml`
  [ff71e718] + MixedModels v4.5.0
    Updating `~/Library/Mobile Documents/com~apple~CloudDocs/work/jarm/Manifest.toml`
  [69666777] + Arrow v2.2.0
  [31f734f8] + ArrowTypes v1.2.1
  [6e4b80f9] + BenchmarkTools v1.2.0
  [c3b6d118] + BitIntegers v0.2.6
  [fa961155] + CEnum v0.4.1
  [d360d2e6] + ChainRulesCore v1.11.1
  [9e997f8a] + ChangesOfVariables v0.1.1
  [523fee87] + CodecBzip2 v0.7.2
  [5ba52731] + CodecLz4 v0.4.0
  [944b1d66] + CodecZlib v0.7.0
  [6b39b394] + CodecZstd v0.7.2
  [34da2185] + Compat v3.40.0
  [9a962f9c] + DataAPI v1.9.0
  [864edb3b] + DataStructures v0.18.10
  [e2d170a0] + DataValueInterfaces v1.0.0
  [b429d917] + DensityInterface v0.4.0
  [31c24e10] + Distributions v0.25.34
  [ffbed154] + DocStringExtensions v0.8.6
  [e2ba6199] + ExprTools v0.1.6
  [1a297f60] + FillArrays v0.12.7
  [38e38edf] + GLM v1.5.1
  [842dd82b] + InlineStrings v1.0.1
  [3587e190] + InverseFunctions v0.1.2
  [92d709cd] + IrrationalConstants v0.1.1
  [82899510] + IteratorInterfaceExtensions v1.0.0
  [692b3bcd] + JLLWrappers v1.3.0
  [682c06a0] + JSON v0.21.2
  [0f8b85d8] + JSON3 v1.9.2
  [2ab3a3ac] + LogExpFunctions v0.3.5
  [b8f27783] + MathOptInterface v0.10.6
  [fdba3010] + MathProgBase v0.7.8
  [e1d29d7a] + Missings v1.0.2
  [ff71e718] + MixedModels v4.5.0
  [78c3b35d] + Mocking v0.7.3
  [d8a4904e] + MutableArithmetics v0.3.1
  [76087f3c] + NLopt v0.6.4
  [bac558e1] + OrderedCollections v1.4.1
  [90014a1f] + PDMats v0.11.5
  [69de0a69] + Parsers v2.1.2
  [2dfb63ee] + PooledArrays v1.4.0
  [21216c6a] + Preferences v1.2.2
  [92933f4c] + ProgressMeter v1.7.1
  [1fd47b50] + QuadGK v2.4.2
  [3cdcf5f2] + RecipesBase v1.2.1
  [189a3867] + Reexport v1.2.2
  [79098fc4] + Rmath v0.7.0
  [91c51154] + SentinelArrays v1.3.8
  [1277b4bf] + ShiftedArrays v1.0.0
  [a2af1166] + SortingAlgorithms v1.0.1
  [276daf66] + SpecialFunctions v1.8.1
  [90137ffa] + StaticArrays v1.2.13
  [82ae8749] + StatsAPI v1.1.0
  [2913bbd2] + StatsBase v0.33.13
  [4c63d2b9] + StatsFuns v0.9.14
  [3eaba693] + StatsModels v0.6.28
  [856f2bd8] + StructTypes v1.8.1
  [3783bdb8] + TableTraits v1.0.1
  [bd369af6] + Tables v1.6.0
  [f269a46b] + TimeZones v1.6.2
  [3bb67fe8] + TranscodingStreams v0.9.6
  [6e34b625] + Bzip2_jll v1.0.8+0
  [5ced341a] + Lz4_jll v1.9.3+0
  [079eb43e] + NLopt_jll v2.7.0+0
  [efe28fd5] + OpenSpecFun_jll v0.5.5+0
  [f50d1b31] + Rmath_jll v0.3.0+0
  [3161d3a3] + Zstd_jll v1.5.0+0
  [0dad84c5] + ArgTools
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [8ba89e20] + Distributed
  [f43a241f] + Downloads
  [9fa8497b] + Future
  [b77e0a4c] + InteractiveUtils
  [4af54fe1] + LazyArtifacts
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [9abbd945] + Profile
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [1a1011a3] + SharedArrays
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [4607b0f0] + SuiteSparse
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [05823500] + OpenLibm_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
Precompiling project...
  ✗ NLopt
  ✗ MixedModels
  16 dependencies successfully precompiled in 26 seconds (55 already precompiled)
  2 dependencies errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the packages

palday · December 2, 2021, 7:04pm

I’m guessing the breakage follows from NLopt not working because NLopt is used extensively inside of MixedModels.

giordano · December 2, 2021, 7:21pm

@stevengj can we please have a new release of nlopt that we can build for the new architectures?

gbaraldi · December 2, 2021, 9:04pm

This is probably memory bound, and the memory bandwith while in rosetta is probably very similar to native. There are also some things that are a lot slower on native than on rosetta, probably cause llvm codegen isn’t as good for aarch64 for that specific case.
Example here Very different performance on M1 mac, native vs rosetta

lawless-m · December 3, 2021, 2:29pm

You might be interested in this post, which uses some more intensive examples, which are also simple to run yourself

stevengj · December 3, 2021, 6:40pm

Done. https://github.com/JuliaPackaging/Yggdrasil/pull/4010

StefanKarpinski · December 3, 2021, 7:52pm

This is actually internally how even x86 processors work: they translate x86 on the fly to microcode which is then actually executed. It’s compilers all the way down.

danvinci · January 22, 2022, 3:20pm

On a slight tangent:

I’m trouble understanding how to setup and run both Julia (ARM) and Julia (x86) on my machine.
With the ARM version I’m having troubles that I don’t have with the x86 version.

Wondering if/how I could be able to call them independently from the terminal as well as having each installation refer to a different ./julia folder when it comes to precompiled packages.

I’ve read about environment variables but not sure how to edit them. Is something I can do directly in the terminal or I have to dig for some config file in the Julia app contents?

ctkelley · January 22, 2022, 4:16pm

I can do this, but it is a bit of a pain. You’ll need two .julia directories, one for each OS. So if you have

dotJuliaM1 and dotJuliaX86 in your home directory and
Julia-1.7.1.M1.app and Julia-1.7.1.x86.app/ in /Applications

Then from your home directory

ln -s dotJuliaM1 .julia

and from /Applications

ln -s Julia-1.7.1.M1.app Julia-1.7.app

Then run Julia-1.7 and it’ll do what you want. Same for moving from M1 to X86

And … using MKL also works on M1 with Rosetta2 and gives you a bit of speedup.

Keep in mind that the ARM version is Tier 3, so you should not be surprised if things don’t work. It is working very well for me right now, but I am ready to go back to X86 if I exercise a bug in the M1 version.

giordano · January 22, 2022, 7:05pm

Why? I have a single depot for both architectures, there are no conflicts.

ctkelley · January 22, 2022, 7:40pm

So there are no conflicts with the different architectures in .julia/packages, .julia/artifiacts, or .julia/complied? Where do the different binaries wind up for different architectures?

giordano · January 22, 2022, 7:44pm

In the packages directory there is the source code of the packages. In the compiled directory there are precompiled files, which are indexed by a slug which depends, among other things, on the absolute path of the sysimage of the current Julia session and of the Julia executable:

github.com

JuliaLang/julia/blob/1db8b8f160786c0ce23aed1fa865301fb9973329/base/loading.jl#L1464-L1465


      
          crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
          crc = _crc32c(unsafe_string(JLOptions().julia_bin), crc)

So no, there are going to be no conflicts.

ctkelley · January 22, 2022, 7:52pm

Thanks. I did not know that. That’ll save me some trouble, starting today.

robsmith11 · January 22, 2022, 8:37pm

A slight tangent, but has anyone compared Julia benchmarks on M1 hardware with MacOS Arm64 binaries to Linux Arm64 (either in a VM or with one of the experimental bare metal projects)?

Topic		Replies	Views
Very different performance on M1 mac, native vs rosetta Performance mac-m1	14	3291	September 20, 2023
macOS: ARM vs Intel Offtopic arm	6	886	April 11, 2022
Speeding up julia on aarch64 Internals & Design aarch64 , arm	15	2458	April 29, 2020
macOS with ARM General Usage question , arm	2	403	August 22, 2022
Julia on M1 Macs Community mac-m1	22	21063	June 16, 2022

MacOS ARM64 no faster than emulated x86?

Related topics