LoopVectorization: @turbo performs worse than @inbounds on trivial loop

Here is what I get – much less of a difference:

julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  34.489 μs … 113.794 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.863 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.359 μs ±   2.933 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄       ▂▁                                                 ▁
  ███▄▁▁▁▁▁▅██▇▇▆▅▄▁▃▁▁▄▃▁▃▄▁▄▃▁▁▁▁▃▄▁▅▆▆▅▅▆▄▅▅▅▄▁▁▅▃▃▅▄▅▅▆▅▆▅ █
  34.5 μs       Histogram: log(frequency) by time      48.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  33.193 μs … 77.087 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     33.623 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.231 μs ±  2.594 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▄       ▃       ▃                                        ▁
  ████▅▃▆▄▁▃▇██▇▆▄▃▃▇█▃▄▃▄▄▄▄▄▃▅▃▃▁▅▄▄▁▃▃▅▇▇▆▇▇▇▆▇▆▅▁▃▃▃▃▃▄▄▆ █
  33.2 μs      Histogram: log(frequency) by time      46.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Are you doing something special to get such tight histograms? I just left my laptop alone while it ran, but I still had a browser open and I’m sure my OS was running things in the background that I could have disabled. Maybe CPU pinning or something?

The benchmarking code for that plot is nothing special, but here it is if you want to run it yourself:

Summary
using LoopVectorization: @turbo
using BenchmarkTools: @benchmark
using Statistics: median

@inline function dot_product(a, b)
    dp = zero(eltype(a))
    for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@inline function dot_product_turbo(a, b)
    dp = zero(eltype(a))
    @turbo for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@inline function dot_product_inbounds(a, b)
    dp = zero(eltype(a))
    @inbounds for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@inline function dot_product_inbounds_simd(a, b)
    dp = zero(eltype(a))
    @inbounds @simd for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@noinline function dot_product_turbo_noinline(a, b)
    dp = zero(eltype(a))
    @turbo for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end


function get_times_dot_product(func, N)
    times = Float64[]
    for n in N
        a = rand(n)
        b = rand(n)

        bench = @benchmark $func($a, $b)
        push!(times, median(bench.times))
    end
    return times
end

N = range(32, 256, step=1)

times1 = get_times_dot_product(dot_product, N)
times2 = get_times_dot_product(dot_product_turbo, N)
times3 = get_times_dot_product(dot_product_inbounds, N)
times4 = get_times_dot_product(dot_product_inbounds_simd, N)
times5 = get_times_dot_product(dot_product_turbo_noinline, N)


# plots

using CairoMakie: Figure, Axis, Legend, lines!

fig = Figure(resolution = (1000,500))
ax = Axis(fig[1, 1], xlabel="N", ylabel="GFLOPS")

l1 = lines!(ax, N, N ./ times1)
l2 = lines!(ax, N, N ./ times2)
l3 = lines!(ax, N, N ./ times3)
l4 = lines!(ax, N, N ./ times4)
l5 = lines!(ax, N, N ./ times5)

fig[1, 2] = Legend(fig, [l1, l2, l3, l4, l5], ["Baseline", "@turbo", "@inbounds", "@inbounds @simd", "@turbo @noinline"])

display(fig)
2 Likes