JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster)

Impressive! And then there’s also this accelerator (Apple M1 GPU from Julia?) that can further speed up linear algebra at least 4x on my machine (see code below).

Would be interesting to see if Apple Silicon processors will start to get used in HPC clusters as some points. Historically, Apple didn’t seem to be interested in this but who knows.

Presumably, there’re applications where you can use multicore computation instead of single core or multithreading where 40 core Intel(R) Xeon(R) will outperform 8 core M1 Max but still pretty interesting. And presumably Apply will make a desktop version of this at some point that will have more than 8 performance cores.

Exciting time for numerical computing!

using LinearAlgebra, BenchmarkTools, SetBlasInt

julia> A = rand(Float32, 1000, 1000);

julia> @benchmark A*A
BenchmarkTools.Trial: 1191 samples with 1 evaluation.
 Range (min … max):  3.739 ms …  15.245 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.053 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.197 ms ± 709.131 μs  ┊ GC (mean ± σ):  2.30% ± 6.11%

           ▆█▄▁                                                
  ▅█▁▄▁▁▁▁▄████▆▄▃▃▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▄▅▅▄▅▇▆█▇▆▅▆▆▆▄▃▅▆▆ ▇
  3.74 ms      Histogram: log(frequency) by time      5.51 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.

julia> BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate")
1705

julia> setblasint(Int32, :sgemm)
1

julia> @benchmark A*A
BenchmarkTools.Trial: 4527 samples with 1 evaluation.
 Range (min … max):  783.959 μs …   6.275 ms  ┊ GC (min … max): 0.00% … 61.15%
 Time  (median):     953.083 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.102 ms ± 501.596 μs  ┊ GC (mean ± σ):  9.91% ± 13.31%

      █▄                                          ▁▁      ▁     ▁
  ▃▁▁███▅▃▄▆▃▁▃▃▃▃▃▁▁▁▃▃▁▃▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▃▃▃██▃▄▇█▆██▆▃▄█ █
  784 μs        Histogram: log(frequency) by time       2.92 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.