Surprising performance characteristic: What is happening here?

I ran into some performance discrepancies, which I do not completely understand. Maybe someone of you has an idea?

I want to compute the product of a vector of elements of length 2^k (for simplicity). But instead
of computing the product directly I want to compute intermediate results and store them in
an array (representing a binary tree: The leafs are x1, x2, …, the parents of those are x1*x2, x3*x4, and those parents are x1*x2*x3*x4, etc).
Here is a implementation I ended up with (details are probably not that important)

function comp_level(n::Integer)
    l = 0
    n = n >> 1
    while n > 0
        l += 1
        n = n >> 1
    end
    l
end
function evaltree!(tree, x::Vector{T}) where T
    n = length(x)

    levels = comp_level(n)

    l = levels-1
    k = 1 << l # = 2^l
    i = 2*k - 1
    j = n
    while i ≥ k
        tree[i] = x[j] * x[j - 1]
        i -= 1
        j -= 2
    end
    kk = k
    for l = levels-2:-1:0
        k = 1 << l # = 2^l
        i = 2*k - 1
        j = kk
        while i ≥ k
            tree[i] = tree[j] * tree[j + 1]
            i -= 1
            j += 2
        end
        kk = k
    end
    tree
end

Now I did some benchmarking against the naive multiplication:

N = 256
tree = Vector{Float64}(N  - 1)
u = rand(N) * 3
@btime evaltree!($tree, $u) # 339.703 ns
@btime prod($u) # 35.891 ns

That’s somewhat in the range that I expected.
But the weird part is when I do the benchmark for Complex128

N = 256
tree3 = Vector{Complex128}(N  - 1)
u3 = rand(Complex128, N) * 3
@btime evaltree!($tree3, $u3) # 498.309 ns
@btime prod($u3) # 719.548 ns

The naive implementation becomes slower!
Does anybody has a clue what is happening here?

1 Like

Interesting! Looking forward to hearing from people know who know what they’re talking about, but goofing off in the mean time:

julia> function prod_loop(x)
         out = one(eltype(x))
         for xᵢ ∈ x
           out *= xᵢ
         end
         out
       end
prod_loop (generic function with 1 method)

julia> @benchmark prod($u)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     52.753 ns (0.00% GC)
  median time:      52.774 ns (0.00% GC)
  mean time:        54.252 ns (0.00% GC)
  maximum time:     93.995 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     986

julia> @benchmark prod_loop($u)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     317.306 ns (0.00% GC)
  median time:      317.357 ns (0.00% GC)
  mean time:        327.991 ns (0.00% GC)
  maximum time:     538.187 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     235

julia> @benchmark prod($u3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     946.923 ns (0.00% GC)
  median time:      947.308 ns (0.00% GC)
  mean time:        968.120 ns (0.00% GC)
  maximum time:     2.259 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     26

julia> @benchmark prod_loop($u3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     947.720 ns (0.00% GC)
  median time:      948.000 ns (0.00% GC)
  mean time:        962.906 ns (0.00% GC)
  maximum time:     2.375 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     25


julia> @benchmark reduce(*, $u)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     52.748 ns (0.00% GC)
  median time:      52.768 ns (0.00% GC)
  mean time:        53.733 ns (0.00% GC)
  maximum time:     97.945 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     986

julia> @benchmark reduce(*, $u3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     947.000 ns (0.00% GC)
  median time:      947.385 ns (0.00% GC)
  mean time:        960.708 ns (0.00% GC)
  maximum time:     1.993 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     26

julia> @benchmark evaltree!($tree, $u)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     359.797 ns (0.00% GC)
  median time:      375.437 ns (0.00% GC)
  mean time:        384.104 ns (0.00% GC)
  maximum time:     1.972 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     207

julia> @benchmark evaltree!($tree3, $u3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     642.963 ns (0.00% GC)
  median time:      657.335 ns (0.00% GC)
  mean time:        671.973 ns (0.00% GC)
  maximum time:     899.701 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     164

It is interesting that prod(::Vector{Float64}) and reduce(*, ::Vector{Float64}) are both much faster than the looped version, but prod(::Vector{Complex{Float64}}) and reduce(*, ::Vector{Complex{Float64}}) are the same speed, and much slower than evaltree!.

Also, still not as fast as evaltree!:

julia> using StructsOfArrays

julia> u4 = convert(StructOfArrays, u3);

julia> @benchmark prod($u4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     761.643 ns (0.00% GC)
  median time:      761.800 ns (0.00% GC)
  mean time:        771.736 ns (0.00% GC)
  maximum time:     1.100 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     115
1 Like

It seems that complex numbers hugely benefit from loop unrolling:

function unroll(xs::Vector{Complex{T}}) where T
    out = one(Complex{T})
    N = length(xs)
    reminder = rem(N, 4)
    i = 1
    while i < (N - reminder)
        @inbounds a = xs[i] * xs[i + 1]
        @inbounds b = xs[i + 2] * xs[i + 3]
        out *= (a * b)

        i += 4
    end

    if reminder == 0
        return out
    end
    @inbounds out *= xs[i]
    if reminder == 1
        return out
    end
    @inbounds out *= xs[i + 1]
    if reminder == 2
        return out
    end
    @inbounds out *= xs[N]

    out
end

I get the following result:

julia> @benchmark unroll($u)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     256.516 ns (0.00% GC)
  median time:      275.364 ns (0.00% GC)
  mean time:        301.272 ns (0.00% GC)
  maximum time:     2.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     349

I think the Float64 case is so fast because of SIMD optimizations. These seem not to work for Complex128, but at least loop unrolling helps. And since evaltree! does some sort of loop unrolling this would explain why it is faster.
Definitely surprising, but at least I think I understand now what is happening…

Try https://github.com/simonster/StructsOfArrays.jl for SIMD with Complex.

To add to what I had yesterday

using StaticArrays, Unrolled
julia> u3s = Size(256)(u3);

julia> @unroll function unrolled_prod(sv)
           out = one(eltype(sv))
           @inbounds @unroll for x ∈ sv
               out *= x
           end
           out
       end
unrolled_prod_unrolled_expansion_ (generic function with 1 method)

julia> @benchmark prod($u3s)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     636.583 ns (0.00% GC)
  median time:      636.661 ns (0.00% GC)
  mean time:        646.826 ns (0.00% GC)
  maximum time:     850.935 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     168

julia> @benchmark unrolled_prod($u3s)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     638.575 ns (0.00% GC)
  median time:      638.677 ns (0.00% GC)
  mean time:        650.012 ns (0.00% GC)
  maximum time:     2.917 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     167

julia> @benchmark unroll($u3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     241.607 ns (0.00% GC)
  median time:      242.077 ns (0.00% GC)
  mean time:        247.740 ns (0.00% GC)
  maximum time:     395.249 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     405

julia> using StructsOfArrays

julia> u4 = convert(StructOfArrays, u3);

julia> @benchmark unroll($u4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     239.351 ns (0.00% GC)
  median time:      252.722 ns (0.00% GC)
  mean time:        249.447 ns (0.00% GC)
  maximum time:     327.695 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     413

julia> @benchmark unroll($u3s)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     358.243 ns (0.00% GC)
  median time:      358.348 ns (0.00% GC)
  mean time:        367.258 ns (0.00% GC)
  maximum time:     1.899 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     210

I replaced Vector with AbstractVector in @saschatimme’s unroll function.
I guess it allows for better use of simd than StructOfArrays or the fully unrolled:

julia> @code_unrolled unrolled_prod(u3s)
quote  # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 105:
    out = one(eltype(sv))
    begin 
        $(Expr(:inbounds, true))
        begin  # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
            let x = sv[1] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
                out *= x
            end
            let x = sv[2] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
                out *= x
            end
            let x = sv[3] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
                out *= x
            end
############## snipped
############## you get the idea
            let x = sv[255] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
                out *= x
            end
            let x = sv[256] # /home/chris/.julia/v0.6/Unrolled/src/Unrolled.jl, line 38:
                out *= x
            end
        end
        $(Expr(:inbounds, :pop))
    end
    out
end

The @unroll and StructOfArrays versions are all quicker to write than the unroll function.
That these easy solutions get much worse performance is disappointing, and suggests some improvements are possible. Lest we feel tempted to tinker around with more complicated expressions, chasing a >2x performance gain like we see here.

1 Like