Here is what I get – much less of a difference:
julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 34.489 μs … 113.794 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.863 μs ┊ GC (median): 0.00%
Time (mean ± σ): 35.359 μs ± 2.933 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▄ ▂▁ ▁
███▄▁▁▁▁▁▅██▇▇▆▅▄▁▃▁▁▄▃▁▃▄▁▄▃▁▁▁▁▃▄▁▅▆▆▅▅▆▄▅▅▅▄▁▁▅▃▃▅▄▅▅▆▅▆▅ █
34.5 μs Histogram: log(frequency) by time 48.9 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 33.193 μs … 77.087 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 33.623 μs ┊ GC (median): 0.00%
Time (mean ± σ): 34.231 μs ± 2.594 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▇▄ ▃ ▃ ▁
████▅▃▆▄▁▃▇██▇▆▄▃▃▇█▃▄▃▄▄▄▄▄▃▅▃▃▁▅▄▄▁▃▃▅▇▇▆▇▇▇▆▇▆▅▁▃▃▃▃▃▄▄▆ █
33.2 μs Histogram: log(frequency) by time 46.6 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Are you doing something special to get such tight histograms? I just left my laptop alone while it ran, but I still had a browser open and I’m sure my OS was running things in the background that I could have disabled. Maybe CPU pinning or something?
The benchmarking code for that plot is nothing special, but here it is if you want to run it yourself:
Summary
using LoopVectorization: @turbo
using BenchmarkTools: @benchmark
using Statistics: median
@inline function dot_product(a, b)
dp = zero(eltype(a))
for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@inline function dot_product_turbo(a, b)
dp = zero(eltype(a))
@turbo for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@inline function dot_product_inbounds(a, b)
dp = zero(eltype(a))
@inbounds for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@inline function dot_product_inbounds_simd(a, b)
dp = zero(eltype(a))
@inbounds @simd for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@noinline function dot_product_turbo_noinline(a, b)
dp = zero(eltype(a))
@turbo for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
function get_times_dot_product(func, N)
times = Float64[]
for n in N
a = rand(n)
b = rand(n)
bench = @benchmark $func($a, $b)
push!(times, median(bench.times))
end
return times
end
N = range(32, 256, step=1)
times1 = get_times_dot_product(dot_product, N)
times2 = get_times_dot_product(dot_product_turbo, N)
times3 = get_times_dot_product(dot_product_inbounds, N)
times4 = get_times_dot_product(dot_product_inbounds_simd, N)
times5 = get_times_dot_product(dot_product_turbo_noinline, N)
# plots
using CairoMakie: Figure, Axis, Legend, lines!
fig = Figure(resolution = (1000,500))
ax = Axis(fig[1, 1], xlabel="N", ylabel="GFLOPS")
l1 = lines!(ax, N, N ./ times1)
l2 = lines!(ax, N, N ./ times2)
l3 = lines!(ax, N, N ./ times3)
l4 = lines!(ax, N, N ./ times4)
l5 = lines!(ax, N, N ./ times5)
fig[1, 2] = Legend(fig, [l1, l2, l3, l4, l5], ["Baseline", "@turbo", "@inbounds", "@inbounds @simd", "@turbo @noinline"])
display(fig)