Very different performance on M1 mac, native vs rosetta

gbaraldi · August 23, 2021, 2:14pm

I was trying the argmin vs minimum benchark that was being discussed and found that the native arm julia is almost 5x slower than running in rosetta. Is the arm codegen that much worse or is that a bug with the M1 specifically. The native version is running on fork of julia with correct feature detection on the M1.
benchmarks:

Native

julia> y = rand(100_000)
100000-element Vector{Float64}:
julia> @benchmark findmin($y)
BenchmarkTools.Trial: 9705 samples with 1 evaluation.
 Range (min … max):  505.000 μs … 627.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     511.125 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   513.248 μs ±   8.133 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇▁▁▁▅▃██▅▅▅▆▃▃▂▁▁▂ ▁▁   ▁▁      ▁                            ▂
  ███████████████████████████████████▇▇▇▇▇▇▆▆▆▅▆▆▄▆▆▅▅▇▆▆▆▇▅▅▇▆ █
  505 μs        Histogram: log(frequency) by time        548 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark argmin($y)
BenchmarkTools.Trial: 9767 samples with 1 evaluation.
 Range (min … max):  504.916 μs … 595.792 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     506.458 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   510.306 μs ±   7.815 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▁▂▃▆▁▁▅▂▁▁▃▂▁▃   ▁                                         ▁
  █████████████████▇▇▇█▇▇▇▇▇▆▇█▇▇▇▇▇▇▆▇▇▇▇▆▇▆▆▆▆▅▅▆▄▆▆▅▅▅▆▆▄▅▆▅ █
  505 μs        Histogram: log(frequency) by time        545 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.8.0-DEV.360
Commit c70db599c2* (2021-08-17 10:32 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.6.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, apple-a14)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_NUM_PRECOMPILE_TASKS = 4

Rosetta

julia> @benchmark findmin($y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  144.583 μs … 199.625 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     144.875 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   145.727 μs ±   2.973 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█          ▂▂         ▁                                      ▁
  ██▆▇▇▇▇▇▄▅▅▄██▆▅▃▆▅▆▆▅▅███▆▆▅▆▄▆▆▅▅▅▅▆▆▅▅▆▆▆▆▆▆▇▅▆▅▅▅▂▄▃▄▅▄▃▃ █
  145 μs        Histogram: log(frequency) by time        159 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark argmin($y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  144.375 μs … 434.167 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     145.270 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   148.420 μs ±   7.243 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁▁▁▁▂▄▁ ▁▂▅▂▂▁▁▂▂▁▁▁▁▂▃▃▂▁                                  ▁
  ████████████████████████████▇▇▆▆▆▅▆▅▄▅▃▅▆▅▅▅▅▅▂▅▄▅▅▄▃▃▄▄▃▄▄▄▄ █
  144 μs        Histogram: log(frequency) by time        170 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

ImreSamu · August 23, 2021, 3:33pm

Can you re-check with the Julia “Nightly” builds?
( newer LLVM is better for supporting new CPUs )

or just add a julia version info;

gbaraldi · August 23, 2021, 3:55pm

Added, thanks

Elrod · August 23, 2021, 6:56pm

FWIW, the M1 is treated the same as an Apple A7.

Not relevant here, but this can cause problems in some code due to not unrolling enough (this code isn’t unrolled anyway).

gbaraldi · August 23, 2021, 7:28pm

Yeah, that’s why I ran on my fork with hard-coded feature detection. It’s running as an a-14, since the m1 target is only on llvm 13

Elrod · August 23, 2021, 7:43pm

Does it still use the cyclone scheduling model, or has this been updated to?
I thought I read that Apple hadn’t contributed a new model since then.

gbaraldi · August 23, 2021, 7:59pm

I dunno. I was searching for the features and saw this. https://github.com/llvm/llvm-project/blob/82507f1798768280cf5d5aab95caaafbc7fe6f47/llvm/include/llvm/Support/AArch64TargetParser.def , which is what I used for the features. But I don’t know if the scheduling model was also updated.

edit:
I checked llvm, and the a-14 still uses the cyclone scheduling

Elrod · August 23, 2021, 8:10pm

Yeah, and you can check simple examples like sum and see it only 2x unrolls. For every x86 CPU I’ve seen, it 4x unrolls.
The M1 benefits even more from unrolling in this example than any x86 CPU.
(Unrolling breaks up the dependency chain of the sum; more unrolling → more out of order opportunity, and the M1 is currently the best CPU in this respect AFAIK [definitely better than any x86].)

gbaraldi · August 23, 2021, 8:11pm

Why hasn’t apple added a new model, wouldn’t swift benefit from it also? Or does swift use their own LLVM?

Elrod · August 23, 2021, 8:13pm

I keep asking the same thing. AMD and Intel do contribute models, but they’re incomplete and well behind models some folks like the authors of uiCA painstakingly reverse engineered.
I think they’d benefit.

But really, it’d help Julia most of all.
Most software is compiled with generic targets (although I think it’s common to set mtune= as something more recent).

gbaraldi · August 23, 2021, 8:16pm

Does LLVM use the “official” schedulers, or do they optimize them further using the data obtained?
And it’s not like apple doesn’t have an LLVM scheduling model written, otherwise they are leaving so much performance on the table. Maybe to hide opcodes or something like it.

gbaraldi · September 3, 2021, 8:02pm

After much messing around I think I might have found a MWE, I don’t know where the difference is, but it’s related to the isgreater function

Setup

julia> a = pairs(rand(100_000))

julia> function test(itr,op)

                         y = iterate(itr)
                         y === nothing && return init
                         v = y[1]
                         while true
                             y = iterate(itr, y[2])
                             y === nothing && break
                             y
                             v = op(v, y[1])
                         end
                         return v
                     end
test (generic function with 1 method)

julia> function withsymbol((fm, im), (ix, fx))
              #@show fm,fx
              fm>fx ? (fx, ix) : (fm, im)
              end
withsymbol (generic function with 1 method)

julia> function withgreater((fm, im), (ix, fx))
       #@show fm,fx
       Base.isgreater(fm, fx) ? (fx, ix) : (fm, im)
       end
withgreater (generic function with 1 method)

Native

julia> @benchmark test(a,withsymbol)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  136.916 μs … 821.875 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     137.083 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   137.795 μs ±   7.250 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▁             ▄                                         ▁   ▁
  ███▆▅▅▇▆▄▃▅▅▄▁▄▇██▆▅▅▅▅█▇▆▅▅▇▇▅▅▆▅▅▄▅▆▅▅▅▅▅▅▅▃▅▅▄▄▄▄▁▄▄▅▅▆█▆▆ █
  137 μs        Histogram: log(frequency) by time        147 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark test(a,withgreater)
BenchmarkTools.Trial: 9486 samples with 1 evaluation.
 Range (min … max):  521.542 μs … 656.209 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     521.834 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   524.659 μs ±   6.788 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁ ▅▅▂▃  ▁▁                   ▁                              ▁
  ███▇████▇████▇█████▇██▇▇▇▇▇█▇▆▇██▇▆▇▇▆▅▆▆▆▆▆▆▅▆▄▃▅▄▆▄▁▅▅▅▆▅▆▆ █
  522 μs        Histogram: log(frequency) by time        558 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

Rosetta

julia> @benchmark test(a,withsymbol)
@benchmark test(a,withgreater)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  102.250 μs … 158.208 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     103.917 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   104.579 μs ±   2.233 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁  ▁
  106 μs           Histogram: frequency by time          104 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark test(a,withgreater)^C

julia> @benchmark test(a,withgreater)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.250 μs … 262.959 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     191.416 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   192.380 μs ±   3.231 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁ ▁
  192 μs           Histogram: frequency by time          191 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.

Rosetta is faster on both, I’m not sure if the < vectorizes better, it might have fewer branches, but I dunno why there is a 5x slowdown.

gbaraldi · September 3, 2021, 8:04pm

And the issue only happens on the findmin path that lowers to mapfoldl if the call is changed a bit it’s much faster

julia> @benchmark findmin(y,dims=1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  105.500 μs … 136.375 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     105.625 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   105.993 μs ±   1.637 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  107 μs           Histogram: frequency by time          106 μs <

 Memory estimate: 224 bytes, allocs estimate: 3.

Elrod · September 4, 2021, 2:33am

I got 3ms on Rosetta there. My native times roughly matched yours, but I can reproduce the argmin and findmin example.

skleinbo · September 20, 2023, 12:29pm

Sorry for necro-posting, but I came upon this thread because I had noticed really poor performance of sum on the M1.

Good news is that LLVM 16 unrolls more: Compiler Explorer

Hopefully Julia 1.11 will make the switch to LLVM>=16

More OT, the poorer performance of native versus Rosetta is no longer an issue since at least Julia 1.8. I suppose that went away with the switch to LLVM 13.

However, I am surprised to see that findmin from the OP is about 25% slower on 1.9 and 1.10 than 1.8

Topic		Replies	Views
MacOS ARM64 no faster than emulated x86? Performance	17	2078	January 22, 2022
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3533	May 16, 2021
Taking advantage of Apple M1? Performance mac-m1 , hardware	27	5433	November 10, 2023
Runtime (memory) on M1 Macbooks: something is not right Performance mac-m1	10	1149	January 17, 2022
Apple M1, M1 pro M1 Max and Julia developpers Offtopic	17	5414	November 1, 2021

Very different performance on M1 mac, native vs rosetta

Native

Rosetta

Setup

Native

Rosetta

Related topics