JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster)

This is just FYI for fun in case people are interested.

I compared JuMP, DiffEq and simple sort() on M1 Max MacBook using official binaries of Julia 1.7.0 with either Native ARM or x86 with Rosetta2.

Native ARM version seems to be 1.5-2x faster in JuMP, DiffEq and sort() benchmarks.

ARM Julia 1.7 still has a bunch of bugs (SVD test segfaults on Apple M1 · Issue #41440 · JuliaLang/julia · GitHub and Darwin/ARM64: Julia freezes on nested `@threads` loops · Issue #41820 · JuliaLang/julia · GitHub) and even the processor seems to not be recognized correctly (Feature/CPU Detection for Apple M1 · Issue #40876 · JuliaLang/julia · GitHub) so perhaps it’ll get even better once everything is fixed!

ARM versioninfo()

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)

x86 versioninfo()

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, westmere)

JuMP code example from GLPK.jl readme.md used for benchmark

using JuMP, GLPK, BenchmarkTools

model = Model(GLPK.Optimizer)
@variable(model, 0 <= x <= 2.5, Int)
@variable(model, 0 <= y <= 2.5, Int)
@objective(model, Max, y)
reasons = UInt8[]
function my_callback_function(cb_data)
    reason = GLPK.glp_ios_reason(cb_data.tree)
    push!(reasons, reason)
    if reason != GLPK.GLP_IROWGEN
        return
    end
    x_val = callback_value(cb_data, x)
    y_val = callback_value(cb_data, y)
    if y_val - x_val > 1 + 1e-6
        con = @build_constraint(y - x <= 1)
        MOI.submit(model, MOI.LazyConstraint(cb_data), con)
    elseif y_val + x_val > 3 + 1e-6
        con = @build_constraint(y - x <= 1)
        MOI.submit(model, MOI.LazyConstraint(cb_data), con)
    end
end
MOI.set(model, GLPK.CallbackFunction(), my_callback_function)
@benchmark optimize!(model)

JuMP ARM (Native):


JuMP x89 (Rosetta2):

DiffEq code from DiffEq manual:

using DifferentialEquations, BenchmarkTools
f(u,p,t) = 1.01*u
u0 = 1/2
tspan = (0.0,1.0)
prob = ODEProblem(f,u0,tspan)
@benchmark DifferentialEquations.solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8)

DiffEq ARM (Native):


DiffEq x89 (Rosetta2):

Code and benchmark of sort()
DiffEq ARM (Native):


DiffEq x89 (Rosetta2):

10 Likes

Super interesting!

Both the diffeq and the sort are probably not multi threaded, right? I’d be very curious how the these numbers stack up against whatever Intel processor currently has the best single core speed.

Yes, I started Julia with 1 thread so that I don’t have to worry about multithreading. So I presume everything is just running on one thread.

It would be really fun to compare to other processors. The only thing I can say here is that my 2018 iMac with 3.5 GHz Inter Core i5 gets 15 ms on the JuMP program which 3x slower than MacBook M1 Max. But of course 2018 Intel Core i5 is not top of the line even for 2018.

If anyone has some fancier Intel or another processor would fun to hear about the benchmarks they get on these three simple problems!?

1 Like

48Gb RAM

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
Environment:
JULIA_CPU_THREADS = 40
JULIA_NUM_THREADS = 40

julia -t 1 -p 1

julia> @benchmark optimize!(model)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.308 μs … 215.231 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.101 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.584 μs ±   2.460 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▆██▄▂
  ▂▂▃▅██████▇▅▄▃▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▁▂▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  13.3 μs         Histogram: frequency by time         19.5 μs <

 Memory estimate: 2.11 KiB, allocs estimate: 56.

julia> @benchmark DifferentialEquations.solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.238 μs …  3.296 ms  ┊ GC (min … max):  0.00% … 99.69%
 Time  (median):     7.619 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   8.079 μs ± 48.457 μs  ┊ GC (mean ± σ):  10.27% ±  1.73%

      ▁▂▁▂▂▁▁▃▄▅▆▄▁               ▁▄▅███▇▆▄▃▁
  ▁▂▄██████████████▆▃▂▂▂▁▁▁▂▂▂▃▅▅▆████████████▇▆▅▅▄▄▄▄▃▃▂▂▂▂ ▄
  5.24 μs        Histogram: frequency by time        9.42 μs <

 Memory estimate: 6.11 KiB, allocs estimate: 38.

julia> @benchmark sort($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.556 μs …  1.347 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.782 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.138 μs ± 13.336 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▃▃▃▂▁       ▁▄▆▇██▇▆▄▄▄▅▅▄▃▂▁                           ▂
  ▃▆███████▇▅▄▅▅▅▆█████████████████████▇▇▅▇██▆▆▆▃▅▆▅▆▅▇▇█▇█▇▆ █
  18.6 μs      Histogram: log(frequency) by time      27.4 μs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

My 2 x 20 Cores weep at your single threaded order of magnitude advantage :slight_smile: (waits for the HPC guys to turn up)

1 Like

Oh cool, thanks @lawless-m!

How about some simple matrix multiplication using all threads?

Below is the simplest example I could think of.

I also tried different number of BLAS threads which you can set using BLAS.set_num_threads(). M1 has 8 performance and 2 efficiency threads so predictably when I go from 8 to 10 threads performance actually decreases.

julia> using LinearAlgebra, BenchmarkTools

julia> A = rand(1000,1000); B = rand(1000,1000);

julia> @benchmark $A * $B
BenchmarkTools.Trial: 665 samples with 1 evaluation.
 Range (min … max):  7.155 ms …  19.538 ms  ┊ GC (min … max): 0.00% … 6.96%
 Time  (median):     7.252 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.518 ms ± 734.446 μs  ┊ GC (mean ± σ):  2.42% ± 4.87%

  ▃█▇▃▂                   ▂▂                                   
  █████▆▅▅▅▆▅▁▄▁▁▄▁▄▄▄▁▁████▇▇▆▆▆▁▁▅▅▅▆▄▅▁▁▄▁▅▁▁▁▁▁▁▁▁▄▅▁▁▄▁▅ ▇
  7.16 ms      Histogram: log(frequency) by time      9.84 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> BLAS.set_num_threads(1)

julia> @benchmark $A * $B
BenchmarkTools.Trial: 119 samples with 1 evaluation.
 Range (min … max):  42.017 ms …  43.615 ms  ┊ GC (min … max): 0.00% … 3.18%
 Time  (median):     42.179 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.364 ms ± 440.317 μs  ┊ GC (mean ± σ):  0.44% ± 1.00%

    ▅▅▄█▄▁                                                      
  ▃▆███████▆▅▃▃▃▄▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▅▄▁▃▃▄▃▁▁▄▁▅▁▁▁▃ ▃
  42 ms           Histogram: frequency by time         43.6 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> BLAS.set_num_threads(2)

julia> @benchmark $A * $B
BenchmarkTools.Trial: 220 samples with 1 evaluation.
 Range (min … max):  22.335 ms …  24.428 ms  ┊ GC (min … max): 0.00% … 7.45%
 Time  (median):     22.536 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.726 ms ± 463.375 μs  ┊ GC (mean ± σ):  0.85% ± 1.87%

     ▃▆█▇▆▃                                                     
  █▁▆██████▇█▆▅▅▁▅▁▅▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▅▆▁██▇█▁█▅▅▆▅▁▅▅▅▁▅▁▁▁▅ ▆
  22.3 ms       Histogram: log(frequency) by time      24.2 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> BLAS.set_num_threads(4)

julia> @benchmark $A * $B
BenchmarkTools.Trial: 392 samples with 1 evaluation.
 Range (min … max):  12.271 ms …  14.460 ms  ┊ GC (min … max): 0.00% … 13.15%
 Time  (median):     12.555 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.760 ms ± 501.337 μs  ┊ GC (mean ± σ):  1.67% ±  3.49%

        ▅▇██▅                                                   
  ▄▁▁▇████████▄▅▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▅█▆▅▆▆▇██▁█▇▆▇▆▄▁▆▅▅ ▆
  12.3 ms       Histogram: log(frequency) by time      14.2 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> BLAS.set_num_threads(8)

julia> @benchmark $A * $B
BenchmarkTools.Trial: 666 samples with 1 evaluation.
 Range (min … max):  7.178 ms …  23.716 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.270 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.509 ms ± 800.333 μs  ┊ GC (mean ± σ):  2.63% ± 5.27%

  ▂██▆▃                          ▁ ▂                           
  ██████▄▆▆▄▄▁▁▁▁▁▁▁▁▅▁▁▁▄▄▄▁▁▄▁▆█████▆▇█▇▆▆▆▅▁▆▄▄▄▄▄▁▁▁▄▁▁▁▄ ▇
  7.18 ms      Histogram: log(frequency) by time       9.2 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> BLAS.set_num_threads(10)

julia> @benchmark $A * $B
BenchmarkTools.Trial: 274 samples with 1 evaluation.
 Range (min … max):  15.185 ms … 47.571 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.907 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.285 ms ±  2.479 ms  ┊ GC (mean ± σ):  1.26% ± 2.71%

           ▁▆▂▁▁▅▁█  ▂█▁▆▃  ▃ ▂ ▁                              
  ▄▄▁▆▁▅▇█▆█████████▇█████▆▆█▆█▆█▅██▅▅▅▄▁▅▇▁▄▇▄▅▆▃▁▃▁▁▄▃▃▁▃▁▃ ▄
  15.2 ms         Histogram: frequency by time        22.9 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

Not as impressive as either of us had hoped - unless you’re rooting for the M1 :slight_smile: !


julia> BLAS.get_num_threads()
40

julia> @benchmark $A * $B
BenchmarkTools.Trial: 391 samples with 1 evaluation.
 Range (min … max):  10.426 ms … 20.151 ms  ┊ GC (min … max): 0.00% … 20.71%
 Time  (median):     10.741 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.798 ms ±  2.620 ms  ┊ GC (mean ± σ):  2.52% ±  6.03%

   ▂█                                                          
  ▅██▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▄▄▂▁▁▂▂▁▁▂▁▁▁▁▁▁▂▃▇▇▂▁▁▁▃▄▃▂ ▂
  10.4 ms         Histogram: frequency by time          17 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2. 

julia> BLAS.set_num_threads(1)

julia> @benchmark $A * $B
BenchmarkTools.Trial: 103 samples with 1 evaluation.
 Range (min … max):  47.310 ms … 51.377 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.462 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.540 ms ±  1.479 ms  ┊ GC (mean ± σ):  0.49% ± 1.27%

  ▅█▅                                          ▅▄          ▂   
  ███▅▆▅▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▅▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▅▁▁▁▁▁▁▁▅█▆ ▅
  47.3 ms      Histogram: log(frequency) by time      51.1 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.
1 Like

Impressive! And then there’s also this accelerator (Apple M1 GPU from Julia?) that can further speed up linear algebra at least 4x on my machine (see code below).

Would be interesting to see if Apple Silicon processors will start to get used in HPC clusters as some points. Historically, Apple didn’t seem to be interested in this but who knows.

Presumably, there’re applications where you can use multicore computation instead of single core or multithreading where 40 core Intel(R) Xeon(R) will outperform 8 core M1 Max but still pretty interesting. And presumably Apply will make a desktop version of this at some point that will have more than 8 performance cores.

Exciting time for numerical computing!

using LinearAlgebra, BenchmarkTools, SetBlasInt

julia> A = rand(Float32, 1000, 1000);

julia> @benchmark A*A
BenchmarkTools.Trial: 1191 samples with 1 evaluation.
 Range (min … max):  3.739 ms …  15.245 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.053 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.197 ms ± 709.131 μs  ┊ GC (mean ± σ):  2.30% ± 6.11%

           ▆█▄▁                                                
  ▅█▁▄▁▁▁▁▄████▆▄▃▃▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▄▅▅▄▅▇▆█▇▆▅▆▆▆▄▃▅▆▆ ▇
  3.74 ms      Histogram: log(frequency) by time      5.51 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.

julia> BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate")
1705

julia> setblasint(Int32, :sgemm)
1

julia> @benchmark A*A
BenchmarkTools.Trial: 4527 samples with 1 evaluation.
 Range (min … max):  783.959 μs …   6.275 ms  ┊ GC (min … max): 0.00% … 61.15%
 Time  (median):     953.083 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.102 ms ± 501.596 μs  ┊ GC (mean ± σ):  9.91% ± 13.31%

      █▄                                          ▁▁      ▁     ▁
  ▃▁▁███▅▃▄▆▃▁▃▃▃▃▃▁▁▁▃▃▁▃▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▃▃▃██▃▄▇█▆██▆▃▄█ █
  784 μs        Histogram: log(frequency) by time       2.92 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 2.


the weak point of the M1 for hpc is still going to be memory. modern hpc will usually want 128gb at a bare minimum, and 1tb isnt uncommon.

I am definitely out of my depth here but I would imagine that depends on application as some calculation needs high memory and some don’t. On my university cluster (System Overview - Research IT) most node are still with 64GB and only a minority of nodes are with high memory like 128GB, 384GB or 1.5TB.

In my very amateur opinion, it seems like Apple probably could make processors that have lots of memory and be good for HPC but it is not clear if they would want to invest resources into this given the fact that they make most of their money from iPhones, iPads and watches (• Apple revenue breakdown by product | Statista). Even Macs are “only” 10% in revenue. And they probably want they chip designer to focus on making small efficient chips for iPhone, iPads and laptops which presumably have different constraints compared to making chips for HPC which have less space/efficiency limitations. But once again, I really know nothing about the nuances of this area so all I said above might be nonsense.

I’ve got a 128 core/256 thread AMD system with the max # of memory banks. I’ll post some BLAS numbers when I’m in the office.
The execution time of that 1000x1000 problem might not be large enough to get an accurate scaling number on that # of cores though.

1 Like

That’s true, but that’s only because most of their nodes are from 2014 and 2015. The main config they bought in 2018 was 96GB, and I’d be surprised if the next set they buy has less than 128gb. It also might be a good bit more since the nodes they bought in 2018 were only 32 cores, and their next purchase will probably be between 64 and 256 cores, so even more memory wouldn’t be a bad idea to be able to feed them.

Yes, that sounds right

On my Dell Precision 7560 Windows 10 mobile workstation, I get the following:

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 1

In other words, comparable to @DenisTitov 's results on Rosetta2 for JUMP, otherwise somewhere in between (also started Julia with 1 thread).

For the BLAS threads test, starting Julia with all 16 available threads, I get the following:

There is no difference between performance and efficiency cores on my system, so I didn’t try any further combinations. Here, it is clear that my performance is consistently a factor 3 behind on @DenisTitov 's result on M1 Max.

Finally, I get:

which of course I couldn’t use any accelerator for. This is 15x slower!

1 Like