I was curious to test the performance of the newly released Julia 1.7 running natively on my M1 laptop.
Surprisingly, there doesn’t seem to be any speedup:
julia> include("b.jl")
sortperf (generic function with 1 method)
julia> using BenchmarkTools
julia> @btime pisum();
4.649 ms (0 allocations: 0 bytes)
julia> @btime sortperf(5000);
250.542 μs (2 allocations: 39.11 KiB)
julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.1.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)
julia>
julia> include("b.jl")
sortperf (generic function with 1 method)
julia> using BenchmarkTools
julia> @btime pisum();
4.663 ms (0 allocations: 0 bytes)
julia> @btime sortperf(5000);
268.416 μs (2 allocations: 39.11 KiB)
julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin21.1.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, westmere)
where b.jl is:
function pisum()
sum = 0.0
for j = 1:500
sum = 0.0
for k = 1:10000
sum += 1.0/(k*k)
end
end
sum
end
function qsort!(a,lo,hi)
i, j = lo, hi
while i < hi
pivot = a[(lo+hi)>>>1]
while i <= j
while a[i] < pivot; i += 1; end
while a[j] > pivot; j -= 1; end
if i <= j
a[i], a[j] = a[j], a[i]
i, j = i+1, j-1
end
end
if lo < j; qsort!(a,lo,j); end
lo, j = i, hi
end
return a
end
sortperf(n) = qsort!(rand(n), 1, n)
It’s a relatively simple benchmark and the Rosetta translation layer is very good, so the x86 code is translated to efficient ARM code. At that point, the processor is the same.
There’s nothing here that’s particularly hard to translate from x86 to ARM, and the examples are small enough that they’re probably all winding up in the CPU cache, so it’s fast. A better test would be something like multiplying large matrices. Or maybe run the entire MixedModels.jl testuite and compare timings (which is something I would be interested in as one of the maintainers of that package without access to a Mac).
This is probably memory bound, and the memory bandwith while in rosetta is probably very similar to native. There are also some things that are a lot slower on native than on rosetta, probably cause llvm codegen isn’t as good for aarch64 for that specific case.
Example here Very different performance on M1 mac, native vs rosetta
This is actually internally how even x86 processors work: they translate x86 on the fly to microcode which is then actually executed. It’s compilers all the way down.
I’m trouble understanding how to setup and run both Julia (ARM) and Julia (x86) on my machine.
With the ARM version I’m having troubles that I don’t have with the x86 version.
Wondering if/how I could be able to call them independently from the terminal as well as having each installation refer to a different ./julia folder when it comes to precompiled packages.
I’ve read about environment variables but not sure how to edit them. Is something I can do directly in the terminal or I have to dig for some config file in the Julia app contents?
I can do this, but it is a bit of a pain. You’ll need two .julia directories, one for each OS. So if you have
dotJuliaM1 and dotJuliaX86 in your home directory and
Julia-1.7.1.M1.app and Julia-1.7.1.x86.app/ in /Applications
Then from your home directory
ln -s dotJuliaM1 .julia
and from /Applications
ln -s Julia-1.7.1.M1.app Julia-1.7.app
Then run Julia-1.7 and it’ll do what you want. Same for moving from M1 to X86
And … using MKL also works on M1 with Rosetta2 and gives you a bit of speedup.
Keep in mind that the ARM version is Tier 3, so you should not be surprised if things don’t work. It is working very well for me right now, but I am ready to go back to X86 if I exercise a bug in the M1 version.
So there are no conflicts with the different architectures in .julia/packages, .julia/artifiacts, or .julia/complied? Where do the different binaries wind up for different architectures?
In the packages directory there is the source code of the packages. In the compiled directory there are precompiled files, which are indexed by a slug which depends, among other things, on the absolute path of the sysimage of the current Julia session and of the Julia executable:
A slight tangent, but has anyone compared Julia benchmarks on M1 hardware with MacOS Arm64 binaries to Linux Arm64 (either in a VM or with one of the experimental bare metal projects)?