# Surprising performance characteristic: What is happening here?

#1

I ran into some performance discrepancies, which I do not completely understand. Maybe someone of you has an idea?

I want to compute the product of a vector of elements of length `2^k` (for simplicity). But instead
of computing the product directly I want to compute intermediate results and store them in
an array (representing a binary tree: The leafs are `x1`, `x2`, …, the parents of those are `x1*x2`, `x3*x4`, and those parents are `x1*x2*x3*x4`, etc).
Here is a implementation I ended up with (details are probably not that important)

``````function comp_level(n::Integer)
l = 0
n = n >> 1
while n > 0
l += 1
n = n >> 1
end
l
end
function evaltree!(tree, x::Vector{T}) where T
n = length(x)

levels = comp_level(n)

l = levels-1
k = 1 << l # = 2^l
i = 2*k - 1
j = n
while i ≥ k
tree[i] = x[j] * x[j - 1]
i -= 1
j -= 2
end
kk = k
for l = levels-2:-1:0
k = 1 << l # = 2^l
i = 2*k - 1
j = kk
while i ≥ k
tree[i] = tree[j] * tree[j + 1]
i -= 1
j += 2
end
kk = k
end
tree
end
``````

Now I did some benchmarking against the naive multiplication:

``````N = 256
tree = Vector{Float64}(N  - 1)
u = rand(N) * 3
@btime evaltree!(\$tree, \$u) # 339.703 ns
@btime prod(\$u) # 35.891 ns
``````

That’s somewhat in the range that I expected.
But the weird part is when I do the benchmark for `Complex128`

``````N = 256
tree3 = Vector{Complex128}(N  - 1)
u3 = rand(Complex128, N) * 3
@btime evaltree!(\$tree3, \$u3) # 498.309 ns
@btime prod(\$u3) # 719.548 ns
``````

The naive implementation becomes slower!
Does anybody has a clue what is happening here?

#2

Interesting! Looking forward to hearing from people know who know what they’re talking about, but goofing off in the mean time:

``````julia> function prod_loop(x)
out = one(eltype(x))
for xᵢ ∈ x
out *= xᵢ
end
out
end
prod_loop (generic function with 1 method)

julia> @benchmark prod(\$u)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     52.753 ns (0.00% GC)
median time:      52.774 ns (0.00% GC)
mean time:        54.252 ns (0.00% GC)
maximum time:     93.995 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     986

julia> @benchmark prod_loop(\$u)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     317.306 ns (0.00% GC)
median time:      317.357 ns (0.00% GC)
mean time:        327.991 ns (0.00% GC)
maximum time:     538.187 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     235

julia> @benchmark prod(\$u3)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     946.923 ns (0.00% GC)
median time:      947.308 ns (0.00% GC)
mean time:        968.120 ns (0.00% GC)
maximum time:     2.259 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     26

julia> @benchmark prod_loop(\$u3)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     947.720 ns (0.00% GC)
median time:      948.000 ns (0.00% GC)
mean time:        962.906 ns (0.00% GC)
maximum time:     2.375 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     25

julia> @benchmark reduce(*, \$u)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     52.748 ns (0.00% GC)
median time:      52.768 ns (0.00% GC)
mean time:        53.733 ns (0.00% GC)
maximum time:     97.945 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     986

julia> @benchmark reduce(*, \$u3)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     947.000 ns (0.00% GC)
median time:      947.385 ns (0.00% GC)
mean time:        960.708 ns (0.00% GC)
maximum time:     1.993 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     26

julia> @benchmark evaltree!(\$tree, \$u)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     359.797 ns (0.00% GC)
median time:      375.437 ns (0.00% GC)
mean time:        384.104 ns (0.00% GC)
maximum time:     1.972 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     207

julia> @benchmark evaltree!(\$tree3, \$u3)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     642.963 ns (0.00% GC)
median time:      657.335 ns (0.00% GC)
mean time:        671.973 ns (0.00% GC)
maximum time:     899.701 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     164

``````

It is interesting that `prod(::Vector{Float64})` and `reduce(*, ::Vector{Float64})` are both much faster than the looped version, but `prod(::Vector{Complex{Float64}})` and `reduce(*, ::Vector{Complex{Float64}})` are the same speed, and much slower than `evaltree!`.

Also, still not as fast as `evaltree!`:

``````julia> using StructsOfArrays

julia> u4 = convert(StructOfArrays, u3);

julia> @benchmark prod(\$u4)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     761.643 ns (0.00% GC)
median time:      761.800 ns (0.00% GC)
mean time:        771.736 ns (0.00% GC)
maximum time:     1.100 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     115
``````

#3

It seems that complex numbers hugely benefit from loop unrolling:

``````function unroll(xs::Vector{Complex{T}}) where T
out = one(Complex{T})
N = length(xs)
reminder = rem(N, 4)
i = 1
while i < (N - reminder)
@inbounds a = xs[i] * xs[i + 1]
@inbounds b = xs[i + 2] * xs[i + 3]
out *= (a * b)

i += 4
end

if reminder == 0
return out
end
@inbounds out *= xs[i]
if reminder == 1
return out
end
@inbounds out *= xs[i + 1]
if reminder == 2
return out
end
@inbounds out *= xs[N]

out
end
``````

I get the following result:

``````julia> @benchmark unroll(\$u)

BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     256.516 ns (0.00% GC)
median time:      275.364 ns (0.00% GC)
mean time:        301.272 ns (0.00% GC)
maximum time:     2.286 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     349
``````

I think the `Float64` case is so fast because of SIMD optimizations. These seem not to work for `Complex128`, but at least loop unrolling helps. And since `evaltree!` does some sort of loop unrolling this would explain why it is faster.
Definitely surprising, but at least I think I understand now what is happening…

#4

Try https://github.com/simonster/StructsOfArrays.jl for SIMD with Complex.

#5

``````using StaticArrays, Unrolled
julia> u3s = Size(256)(u3);

julia> @unroll function unrolled_prod(sv)
out = one(eltype(sv))
@inbounds @unroll for x ∈ sv
out *= x
end
out
end
unrolled_prod_unrolled_expansion_ (generic function with 1 method)

julia> @benchmark prod(\$u3s)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     636.583 ns (0.00% GC)
median time:      636.661 ns (0.00% GC)
mean time:        646.826 ns (0.00% GC)
maximum time:     850.935 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     168

julia> @benchmark unrolled_prod(\$u3s)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     638.575 ns (0.00% GC)
median time:      638.677 ns (0.00% GC)
mean time:        650.012 ns (0.00% GC)
maximum time:     2.917 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     167

julia> @benchmark unroll(\$u3)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     241.607 ns (0.00% GC)
median time:      242.077 ns (0.00% GC)
mean time:        247.740 ns (0.00% GC)
maximum time:     395.249 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     405

julia> using StructsOfArrays

julia> u4 = convert(StructOfArrays, u3);

julia> @benchmark unroll(\$u4)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     239.351 ns (0.00% GC)
median time:      252.722 ns (0.00% GC)
mean time:        249.447 ns (0.00% GC)
maximum time:     327.695 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     413

julia> @benchmark unroll(\$u3s)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     358.243 ns (0.00% GC)
median time:      358.348 ns (0.00% GC)
mean time:        367.258 ns (0.00% GC)
maximum time:     1.899 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     210
``````

I replaced Vector with AbstractVector in @saschatimme’s unroll function.
I guess it allows for better use of simd than StructOfArrays or the fully unrolled:

``````julia> @code_unrolled unrolled_prod(u3s)
quote  # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 105:
out = one(eltype(sv))
begin
\$(Expr(:inbounds, true))
begin  # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
let x = sv[1] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
let x = sv[2] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
let x = sv[3] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
############## snipped
############## you get the idea
let x = sv[255] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
let x = sv[256] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
out *= x
end
end
\$(Expr(:inbounds, :pop))
end
out
end
``````

The `@unroll` and StructOfArrays versions are all quicker to write than the unroll function.
That these easy solutions get much worse performance is disappointing, and suggests some improvements are possible. Lest we feel tempted to tinker around with more complicated expressions, chasing a >2x performance gain like we see here.