LoopVectorization: @turbo performs worse than @inbounds on trivial loop

Yes, on an old-ish i7-4790k Haswell, they have more comparable performance :

julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  19.600 μs … 162.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.822 μs ±   2.164 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅  █   ▆  ▂                    ▁                             ▁
  █▁▁█▁▁▁█▁▁█▁▁▁▇▁▁▄▁▁▁▁▁▁▄▁▁▁▇▁▁█▁▁▁█▁▁▇▁▁▁▅▁▁▅▁▁▁▅▁▁▃▁▁▁▇▁▁▆ █
  19.6 μs       Histogram: log(frequency) by time      21.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.300 μs …  32.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.434 μs ± 370.649 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆    █     ▆                                         ▁       ▁
  █▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▆▁▁▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▇▁▁▁▁▁█▁▁▁▁▁▇ █
  17.3 μs       Histogram: log(frequency) by time      18.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes