Very different performance on M1 mac, native vs rosetta

I was trying the argmin vs minimum benchark that was being discussed and found that the native arm julia is almost 5x slower than running in rosetta. Is the arm codegen that much worse or is that a bug with the M1 specifically. The native version is running on fork of julia with correct feature detection on the M1.
benchmarks:

Native

julia> y = rand(100_000)
100000-element Vector{Float64}:
julia> @benchmark findmin($y)
BenchmarkTools.Trial: 9705 samples with 1 evaluation.
 Range (min … max):  505.000 ΞΌs … 627.750 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     511.125 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   513.248 ΞΌs Β±   8.133 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–‡β–β–β–β–…β–ƒβ–ˆβ–ˆβ–…β–…β–…β–†β–ƒβ–ƒβ–‚β–β–β–‚ ▁▁   ▁▁      ▁                            β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–…β–†β–†β–„β–†β–†β–…β–…β–‡β–†β–†β–†β–‡β–…β–…β–‡β–† β–ˆ
  505 ΞΌs        Histogram: log(frequency) by time        548 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark argmin($y)
BenchmarkTools.Trial: 9767 samples with 1 evaluation.
 Range (min … max):  504.916 ΞΌs … 595.792 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     506.458 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   510.306 ΞΌs Β±   7.815 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–†β–ˆβ–β–‚β–ƒβ–†β–β–β–…β–‚β–β–β–ƒβ–‚β–β–ƒ   ▁                                         ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–‡β–†β–†β–†β–†β–…β–…β–†β–„β–†β–†β–…β–…β–…β–†β–†β–„β–…β–†β–… β–ˆ
  505 ΞΌs        Histogram: log(frequency) by time        545 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.8.0-DEV.360
Commit c70db599c2* (2021-08-17 10:32 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.6.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, apple-a14)
Environment:
  JULIA_NUM_THREADS = 4
  JULIA_NUM_PRECOMPILE_TASKS = 4

Rosetta

julia> @benchmark findmin($y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  144.583 ΞΌs … 199.625 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     144.875 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   145.727 ΞΌs Β±   2.973 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆ          β–‚β–‚         ▁                                      ▁
  β–ˆβ–ˆβ–†β–‡β–‡β–‡β–‡β–‡β–„β–…β–…β–„β–ˆβ–ˆβ–†β–…β–ƒβ–†β–…β–†β–†β–…β–…β–ˆβ–ˆβ–ˆβ–†β–†β–…β–†β–„β–†β–†β–…β–…β–…β–…β–†β–†β–…β–…β–†β–†β–†β–†β–†β–†β–‡β–…β–†β–…β–…β–…β–‚β–„β–ƒβ–„β–…β–„β–ƒβ–ƒ β–ˆ
  145 ΞΌs        Histogram: log(frequency) by time        159 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark argmin($y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  144.375 ΞΌs … 434.167 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     145.270 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   148.420 ΞΌs Β±   7.243 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–β–β–β–β–β–‚β–„β– ▁▂▅▂▂▁▁▂▂▁▁▁▁▂▃▃▂▁                                  ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–…β–†β–…β–„β–…β–ƒβ–…β–†β–…β–…β–…β–…β–…β–‚β–…β–„β–…β–…β–„β–ƒβ–ƒβ–„β–„β–ƒβ–„β–„β–„β–„ β–ˆ
  144 ΞΌs        Histogram: log(frequency) by time        170 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Can you re-check with the Julia β€œNightly” builds?
( newer LLVM is better for supporting new CPUs )

or just add a julia version info;

1 Like

Added, thanks

FWIW, the M1 is treated the same as an Apple A7.

Not relevant here, but this can cause problems in some code due to not unrolling enough (this code isn’t unrolled anyway).

1 Like

Yeah, that’s why I ran on my fork with hard-coded feature detection. It’s running as an a-14, since the m1 target is only on llvm 13

1 Like

Does it still use the cyclone scheduling model, or has this been updated to?
I thought I read that Apple hadn’t contributed a new model since then.

I dunno. I was searching for the features and saw this. https://github.com/llvm/llvm-project/blob/82507f1798768280cf5d5aab95caaafbc7fe6f47/llvm/include/llvm/Support/AArch64TargetParser.def , which is what I used for the features. But I don’t know if the scheduling model was also updated.

edit:
I checked llvm, and the a-14 still uses the cyclone scheduling :frowning:

Yeah, and you can check simple examples like sum and see it only 2x unrolls. For every x86 CPU I’ve seen, it 4x unrolls.
The M1 benefits even more from unrolling in this example than any x86 CPU.
(Unrolling breaks up the dependency chain of the sum; more unrolling β†’ more out of order opportunity, and the M1 is currently the best CPU in this respect AFAIK [definitely better than any x86].)

Why hasn’t apple added a new model, wouldn’t swift benefit from it also? Or does swift use their own LLVM?

I keep asking the same thing. AMD and Intel do contribute models, but they’re incomplete and well behind models some folks like the authors of uiCA painstakingly reverse engineered.
I think they’d benefit.

But really, it’d help Julia most of all.
Most software is compiled with generic targets (although I think it’s common to set mtune= as something more recent).

Does LLVM use the β€œofficial” schedulers, or do they optimize them further using the data obtained?
And it’s not like apple doesn’t have an LLVM scheduling model written, otherwise they are leaving so much performance on the table. Maybe to hide opcodes or something like it.

After much messing around I think I might have found a MWE, I don’t know where the difference is, but it’s related to the isgreater function

Setup

julia> a = pairs(rand(100_000))

julia> function test(itr,op)

                         y = iterate(itr)
                         y === nothing && return init
                         v = y[1]
                         while true
                             y = iterate(itr, y[2])
                             y === nothing && break
                             y
                             v = op(v, y[1])
                         end
                         return v
                     end
test (generic function with 1 method)

julia> function withsymbol((fm, im), (ix, fx))
              #@show fm,fx
              fm>fx ? (fx, ix) : (fm, im)
              end
withsymbol (generic function with 1 method)

julia> function withgreater((fm, im), (ix, fx))
       #@show fm,fx
       Base.isgreater(fm, fx) ? (fx, ix) : (fm, im)
       end
withgreater (generic function with 1 method)

Native

julia> @benchmark test(a,withsymbol)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  136.916 ΞΌs … 821.875 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     137.083 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   137.795 ΞΌs Β±   7.250 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–†β–             β–„                                         ▁   ▁
  β–ˆβ–ˆβ–ˆβ–†β–…β–…β–‡β–†β–„β–ƒβ–…β–…β–„β–β–„β–‡β–ˆβ–ˆβ–†β–…β–…β–…β–…β–ˆβ–‡β–†β–…β–…β–‡β–‡β–…β–…β–†β–…β–…β–„β–…β–†β–…β–…β–…β–…β–…β–…β–…β–ƒβ–…β–…β–„β–„β–„β–„β–β–„β–„β–…β–…β–†β–ˆβ–†β–† β–ˆ
  137 ΞΌs        Histogram: log(frequency) by time        147 ΞΌs <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark test(a,withgreater)
BenchmarkTools.Trial: 9486 samples with 1 evaluation.
 Range (min … max):  521.542 ΞΌs … 656.209 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     521.834 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   524.659 ΞΌs Β±   6.788 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–β– β–…β–…β–‚β–ƒ  ▁▁                   ▁                              ▁
  β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–ˆβ–‡β–†β–‡β–ˆβ–ˆβ–‡β–†β–‡β–‡β–†β–…β–†β–†β–†β–†β–†β–†β–…β–†β–„β–ƒβ–…β–„β–†β–„β–β–…β–…β–…β–†β–…β–†β–† β–ˆ
  522 ΞΌs        Histogram: log(frequency) by time        558 ΞΌs <

 Memory estimate: 32 bytes, allocs estimate: 1.

Rosetta

julia> @benchmark test(a,withsymbol)
@benchmark test(a,withgreater)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  102.250 ΞΌs … 158.208 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     103.917 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   104.579 ΞΌs Β±   2.233 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– ▁▁▁  ▁
  106 ΞΌs           Histogram: frequency by time          104 ΞΌs <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark test(a,withgreater)^C

julia> @benchmark test(a,withgreater)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.250 ΞΌs … 262.959 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     191.416 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   192.380 ΞΌs Β±   3.231 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– ▁▁ ▁
  192 ΞΌs           Histogram: frequency by time          191 ΞΌs <

 Memory estimate: 32 bytes, allocs estimate: 1.

Rosetta is faster on both, I’m not sure if the < vectorizes better, it might have fewer branches, but I dunno why there is a 5x slowdown.

And the issue only happens on the findmin path that lowers to mapfoldl if the call is changed a bit it’s much faster

julia> @benchmark findmin(y,dims=1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  105.500 ΞΌs … 136.375 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     105.625 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   105.993 ΞΌs Β±   1.637 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– ▁
  107 ΞΌs           Histogram: frequency by time          106 ΞΌs <

 Memory estimate: 224 bytes, allocs estimate: 3.

I got 3ms on Rosetta there. My native times roughly matched yours, but I can reproduce the argmin and findmin example.

Sorry for necro-posting, but I came upon this thread because I had noticed really poor performance of sum on the M1.

Good news is that LLVM 16 unrolls more: Compiler Explorer

Hopefully Julia 1.11 will make the switch to LLVM>=16 :slight_smile:

More OT, the poorer performance of native versus Rosetta is no longer an issue since at least Julia 1.8. I suppose that went away with the switch to LLVM 13.

However, I am surprised to see that findmin from the OP is about 25% slower on 1.9 and 1.10 than 1.8 :confused: