Yes, I was going to ask if the vrr!
call you were benchmarking was really representative of the typical call.
For one thing, I noticed that in your actual code it gets called with SVector
instead of Vector
arguments.
@turbo
speeds things up be evaluating multiple loop iterations in parallel using SIMD instructions.
For this to speed things up, we need multiple loop iterations.
Of course, @simd
works the same way, but both macros make different trade offs in terms of performance as a function of the number of loop iterations.
Here is a very simple example, just using a sum:
julia> using LoopVectorization
julia> function sumsimd(x)
s = zero(eltype(x))
@inbounds @simd for i ∈ eachindex(x)
s += x[i]
end
s
end
sumsimd (generic function with 1 method)
julia> function sumturbo(x)
s = zero(eltype(x))
@turbo for i ∈ eachindex(x)
s += x[i]
end
s
end
sumturbo (generic function with 1 method)
julia> x = rand(1);
julia> @btime sumsimd($x)
1.764 ns (0 allocations: 0 bytes)
0.028409636745355238
julia> @btime sumturbo($x)
3.996 ns (0 allocations: 0 bytes)
0.028409636745355238
sumsimd
was over twice as fast for a single iteration.
using Cairo
using Fontconfig, Gadfly, LoopVectorization, BenchmarkTools, DataFrames
itimes = Matrix{Float64}(undef, 64, 2);
@inline function isumsimd(x)
s = zero(eltype(x))
@inbounds @simd for i ∈ eachindex(x)
s += x[i]
end
s
end
@inline function isumturbo(x)
s = zero(eltype(x))
@turbo for i ∈ eachindex(x)
s += x[i]
end
s
end
for n ∈ axes(itimes,1)
x = rand(n)
itimes[n,1] = @belapsed isumsimd($x)
itimes[n,2] = @belapsed isumturbo($x)
end
elements_per_ns = 1e-9 .* axes(itimes, 1) ./ itimes;
df = DataFrame(elements_per_ns);
rename!(df, [:simd, :turbo]);
df.Size = axes(elements_per_ns,1);
dfs = stack(df, Not([:Size]));
rename!(dfs, [:Size, :Macro, :elements_per_ns]);
plt = plot(dfs, x = :Size, y = :elements_per_ns, Geom.line, color = :Macro);
plt |> PNG("sumbenchmark.png")
I think overall LoopVectorization makes a much better tradeoff, but @simd
is faster for small numbers of iterations, as well as for small multiples of a power of 2 (32 on this computer, 16 for computers with AVX but not AVX512).
For the sum
example, LoopVectorization should eventually start winning there as well, but it’ll take more iterations:
julia> x = rand(512);
julia> @btime isumsimd($x)
18.055 ns (0 allocations: 0 bytes)
267.1071894162856
julia> @btime isumturbo($x)
16.915 ns (0 allocations: 0 bytes)
267.1071894162856
Versus the first 8:
julia> first(df,8) # elements per nanosecond
8×3 DataFrame
Row │ simd turbo Size
│ Float64 Float64 Int64
─────┼───────────────────────────
1 │ 0.713267 0.338753 1
2 │ 1.08284 0.67659 2
3 │ 1.21655 0.974976 3
4 │ 1.51515 1.35318 4
5 │ 1.8162 1.69377 5
6 │ 2.17628 2.03321 6
7 │ 2.17729 2.30491 7
8 │ 2.17984 2.70819 8
@simd
is over twice as fast for 1-element vectors. It takes LoopVectorization a while to catch up.