I ran into some performance discrepancies, which I do not completely understand. Maybe someone of you has an idea?
I want to compute the product of a vector of elements of length 2^k (for simplicity). But instead
of computing the product directly I want to compute intermediate results and store them in
an array (representing a binary tree: The leafs are x1, x2, …, the parents of those are x1*x2, x3*x4, and those parents are x1*x2*x3*x4, etc).
Here is a implementation I ended up with (details are probably not that important)
function comp_level(n::Integer)
l = 0
n = n >> 1
while n > 0
l += 1
n = n >> 1
end
l
end
function evaltree!(tree, x::Vector{T}) where T
n = length(x)
levels = comp_level(n)
l = levels-1
k = 1 << l # = 2^l
i = 2*k - 1
j = n
while i ≥ k
tree[i] = x[j] * x[j - 1]
i -= 1
j -= 2
end
kk = k
for l = levels-2:-1:0
k = 1 << l # = 2^l
i = 2*k - 1
j = kk
while i ≥ k
tree[i] = tree[j] * tree[j + 1]
i -= 1
j += 2
end
kk = k
end
tree
end
Now I did some benchmarking against the naive multiplication:
N = 256
tree = Vector{Float64}(N - 1)
u = rand(N) * 3
@btime evaltree!($tree, $u) # 339.703 ns
@btime prod($u) # 35.891 ns
That’s somewhat in the range that I expected.
But the weird part is when I do the benchmark for Complex128
It is interesting that prod(::Vector{Float64}) and reduce(*, ::Vector{Float64}) are both much faster than the looped version, but prod(::Vector{Complex{Float64}}) and reduce(*, ::Vector{Complex{Float64}}) are the same speed, and much slower than evaltree!.
It seems that complex numbers hugely benefit from loop unrolling:
function unroll(xs::Vector{Complex{T}}) where T
out = one(Complex{T})
N = length(xs)
reminder = rem(N, 4)
i = 1
while i < (N - reminder)
@inbounds a = xs[i] * xs[i + 1]
@inbounds b = xs[i + 2] * xs[i + 3]
out *= (a * b)
i += 4
end
if reminder == 0
return out
end
@inbounds out *= xs[i]
if reminder == 1
return out
end
@inbounds out *= xs[i + 1]
if reminder == 2
return out
end
@inbounds out *= xs[N]
out
end
I think the Float64 case is so fast because of SIMD optimizations. These seem not to work for Complex128, but at least loop unrolling helps. And since evaltree! does some sort of loop unrolling this would explain why it is faster.
Definitely surprising, but at least I think I understand now what is happening…
using StaticArrays, Unrolled
julia> u3s = Size(256)(u3);
julia> @unroll function unrolled_prod(sv)
out = one(eltype(sv))
@inbounds @unroll for x ∈ sv
out *= x
end
out
end
unrolled_prod_unrolled_expansion_ (generic function with 1 method)
julia> @benchmark prod($u3s)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 636.583 ns (0.00% GC)
median time: 636.661 ns (0.00% GC)
mean time: 646.826 ns (0.00% GC)
maximum time: 850.935 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 168
julia> @benchmark unrolled_prod($u3s)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 638.575 ns (0.00% GC)
median time: 638.677 ns (0.00% GC)
mean time: 650.012 ns (0.00% GC)
maximum time: 2.917 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 167
julia> @benchmark unroll($u3)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 241.607 ns (0.00% GC)
median time: 242.077 ns (0.00% GC)
mean time: 247.740 ns (0.00% GC)
maximum time: 395.249 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 405
julia> using StructsOfArrays
julia> u4 = convert(StructOfArrays, u3);
julia> @benchmark unroll($u4)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 239.351 ns (0.00% GC)
median time: 252.722 ns (0.00% GC)
mean time: 249.447 ns (0.00% GC)
maximum time: 327.695 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 413
julia> @benchmark unroll($u3s)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 358.243 ns (0.00% GC)
median time: 358.348 ns (0.00% GC)
mean time: 367.258 ns (0.00% GC)
maximum time: 1.899 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 210
I replaced Vector with AbstractVector in @saschatimme’s unroll function.
I guess it allows for better use of simd than StructOfArrays or the fully unrolled:
julia> @code_unrolled unrolled_prod(u3s)
quote # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 105:
out = one(eltype(sv))
begin
$(Expr(:inbounds, true))
begin # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
let x = sv[1] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
let x = sv[2] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
let x = sv[3] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
############## snipped
############## you get the idea
let x = sv[255] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
let x = sv[256] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
end
$(Expr(:inbounds, :pop))
end
out
end
The @unroll and StructOfArrays versions are all quicker to write than the unroll function.
That these easy solutions get much worse performance is disappointing, and suggests some improvements are possible. Lest we feel tempted to tinker around with more complicated expressions, chasing a >2x performance gain like we see here.