Branching in loop decrease perf by ~x5

wizebt · December 7, 2020, 9:05am

Hi,

In Julia 1.3.x version extrema function use to be much slower that minimum + maximum execution. At that time I wrote my own specialized extrema function and gained ~x3 by avoiding one test out of the two in each iteration (algo change).
I recently noticed that this extrema performance issue was corrected in the current 1.5.3. and now extrema is performing as fast as mimimum + maximum.
However when replicating the algo change trying to avoid one test on each iteration I got a counter perf by a factor of ~5
See test code and bench results below.
Anyone has an idea on how to keep the algo improvement and gain in perf? It seems that the branching is fooling LLVM optimization.
Thanks

using BenchmarkTools
n = 1000;
a = rand(1:n, n);

function extrema1(a::Vector{Int64})
    vmin = vmax = @inbounds a[1]
    for i = 2:length(a)
        v = @inbounds a[i]
        if v < vmin
            vmin = v
        end
        if v > vmax
            vmax = v
        end
    end
    return (vmin, vmax)
end

function extrema2(a::Vector{Int64})
    vmin = vmax = @inbounds a[1]
    for i = 2:length(a)
        v = @inbounds a[i]
        if v < vmin
            vmin = v
        elseif v > vmax
            vmax = v
        end
    end
    return (vmin, vmax)
end

function extrema3(a::Vector{Int64})
    vmin = vmax = @inbounds a[1]
    for i = 2:length(a)
        v = @inbounds a[i]
        if v < vmin
            vmin = v
            continue
        end
        if v > vmax
            vmax = v
        end
    end
    return (vmin, vmax)
end

julia> @benchmark extrema($a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     272.136 ns (0.00% GC)
  median time:      272.755 ns (0.00% GC)
  mean time:        284.915 ns (0.00% GC)
  maximum time:     589.474 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     323

julia> @benchmark extrema1($a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     268.232 ns (0.00% GC)
  median time:      268.824 ns (0.00% GC)
  mean time:        279.133 ns (0.00% GC)
  maximum time:     705.588 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     340

julia> @benchmark extrema2($a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.250 μs (0.00% GC)
  median time:      1.250 μs (0.00% GC)
  mean time:        1.304 μs (0.00% GC)
  maximum time:     3.680 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark extrema3($a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.250 μs (0.00% GC)
  median time:      1.290 μs (0.00% GC)
  mean time:        1.358 μs (0.00% GC)
  maximum time:     5.610 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> VERSION
v"1.5.3"

Elrod · December 7, 2020, 9:24am

The problem is that the code uses SIMD instructions. Using:

@code_llvm debuginfo=:none extrema1(a)

This is the vectorized loop body:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %24, %vector.body ]
  %vec.phi20 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %25, %vector.body ]
  %vec.phi21 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %26, %vector.body ]
  %vec.phi22 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %27, %vector.body ]
  %vec.phi23 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %32, %vector.body ]
  %vec.phi24 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %33, %vector.body ]
  %vec.phi25 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %34, %vector.body ]
  %vec.phi26 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %35, %vector.body ]
  %offset.idx = or i64 %index, 2
  %11 = add nsw i64 %offset.idx, -1
  %12 = getelementptr inbounds i64, i64* %3, i64 %11
  %13 = bitcast i64* %12 to <8 x i64>*
  %wide.load = load <8 x i64>, <8 x i64>* %13, align 8
  %14 = getelementptr inbounds i64, i64* %12, i64 8
  %15 = bitcast i64* %14 to <8 x i64>*
  %wide.load27 = load <8 x i64>, <8 x i64>* %15, align 8
  %16 = getelementptr inbounds i64, i64* %12, i64 16
  %17 = bitcast i64* %16 to <8 x i64>*
  %wide.load28 = load <8 x i64>, <8 x i64>* %17, align 8
  %18 = getelementptr inbounds i64, i64* %12, i64 24
  %19 = bitcast i64* %18 to <8 x i64>*
  %wide.load29 = load <8 x i64>, <8 x i64>* %19, align 8
  %20 = icmp slt <8 x i64> %wide.load, %vec.phi
  %21 = icmp slt <8 x i64> %wide.load27, %vec.phi20
  %22 = icmp slt <8 x i64> %wide.load28, %vec.phi21
  %23 = icmp slt <8 x i64> %wide.load29, %vec.phi22
  %24 = select <8 x i1> %20, <8 x i64> %wide.load, <8 x i64> %vec.phi
  %25 = select <8 x i1> %21, <8 x i64> %wide.load27, <8 x i64> %vec.phi20
  %26 = select <8 x i1> %22, <8 x i64> %wide.load28, <8 x i64> %vec.phi21
  %27 = select <8 x i1> %23, <8 x i64> %wide.load29, <8 x i64> %vec.phi22
  %28 = icmp slt <8 x i64> %vec.phi23, %wide.load
  %29 = icmp slt <8 x i64> %vec.phi24, %wide.load27
  %30 = icmp slt <8 x i64> %vec.phi25, %wide.load28
  %31 = icmp slt <8 x i64> %vec.phi26, %wide.load29
  %32 = select <8 x i1> %28, <8 x i64> %wide.load, <8 x i64> %vec.phi23
  %33 = select <8 x i1> %29, <8 x i64> %wide.load27, <8 x i64> %vec.phi24
  %34 = select <8 x i1> %30, <8 x i64> %wide.load28, <8 x i64> %vec.phi25
  %35 = select <8 x i1> %31, <8 x i64> %wide.load29, <8 x i64> %vec.phi26
  %index.next = add i64 %index, 32
  %36 = icmp eq i64 %index.next, %n.vec
  br i1 %36, label %middle.block, label %vector.body

On my computer, it is doing 8 comparisons on every iteration: 4 slt (signed less than) of the load vs the min, and 4 checks of if the max is less than the load. Each of these 8 comparisons is actually comparing 8 numbers, meaning 64 comparisons happen on every iteration of the loop!
64 comparisons, not to.

  %20 = icmp slt <8 x i64> %wide.load, %vec.phi
  %24 = select <8 x i1> %20, <8 x i64> %wide.load, <8 x i64> %vec.phi

That is a comparison with <8 x i64> that it uses to then select from those values.

Unfortunately, when doing many-per-comparison, it’s hard to branch.
If you add a branch saying that, if the first comparison triggers, don’t do the second, how is that supposed to work? What is the computer supposed to be doing, if each instruction does many?

You could mask off the subsequent comparisons, that is, do them anyway but ignore the result. That is, do the first 8 comparisons with one instruction. Then, do the second 8 as well, but selectively ignore these 8 based on how the first results went. That would give you 0 performance benefit – you’re still doing the same number of comparisons – but the performance penalty would also be pretty negligible.

Or, you could stop doing 8 comparisons per instruction, and then actually branch like your code says. That works too, but naturally has a massive performance hit. That is what you’re seeing.

BTW, this is faster:

julia> function extrema5(a::AbstractVector{T}) where {T}
           vmin = typemax(T); vmax = typemin(T)
           for i = eachindex(a)
               v = @inbounds a[i]
               if v < vmin
                   vmin = v
               end
               if v > vmax
                   vmax = v
               end
           end
           return (vmin, vmax)
       end
extrema5 (generic function with 1 method)

julia> @btime extrema1($a)
  58.404 ns (0 allocations: 0 bytes)
(1, 1000)

julia> @btime extrema2($a)
  671.128 ns (0 allocations: 0 bytes)
(1, 1000)

julia> @btime extrema5($a)
  57.974 ns (0 allocations: 0 bytes)
(1, 1000)

In general, it’s better to start iterating arrays on their starting value for some low level reasons (Julia aligns the start of arrays; loads that are aligned are often faster).

Also, most recent computers are limited to/capable of 4x 64bit numbers/instruction, not 8x. Thus I see a >10x performance hit, instead of a ~5x.

feanor12 · December 7, 2020, 10:03am

You can also replace if statements by multiplication and addition to avoid branching. This could enable SIMD use, but I am not sure if this is possible in julia.

in C something like this can be done

if (b > a) b = a;

bool if_else = b > a;
b = a * if_else + b * !if_else;

Elrod · December 7, 2020, 10:05am

The compiler normally does a good job making transformations like this, especially with integers, so things like this normally aren’t necessary.
Plus, that transform is actually further from the assembly you actually want.

Use ifelse if you want to avoid branches.

feanor12 · December 7, 2020, 10:07am

I did not know about ifelse.

wizebt · December 12, 2020, 11:01am

Thanks Elrod for your very clear and documented explanation on perf diff. and your excellent extrema5 version

Still thinking on how to adapt the single comp using a SIMD implementation or a parallel algo on CUDA…

In the mean time, tried some variations to compare iterator’s performance.

To set a ref Base.extrema

using BenchmarkTools
n = 1000;
a = rand(1:n, n);

julia> @benchmark extrema($a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     274.347 ns (0.00% GC)
  median time:      274.675 ns (0.00% GC)
  mean time:        288.293 ns (0.00% GC)
  maximum time:     651.623 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     308

your excellent extrema5 version, tests are slightly modified with ternay op without introducing and perf change.

function extrema5(a::AbstractVector{T}) where {T}
    vmin = typemax(T)
    vmax = typemin(T)
    for i in eachindex(a)
        v = @inbounds a[i]
        #if v < vmin vmin = v end
        #if v > vmax vmax = v end
        vmin = v < vmin ? v : vmin
        vmax = v > vmax ? v : vmax
    end
    return (vmin, vmax)
end

julia> @benchmark extrema5($a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     267.636 ns (0.00% GC)
  median time:      267.638 ns (0.00% GC)
  mean time:        281.006 ns (0.00% GC)
  maximum time:     752.478 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     343

And a faster one

function extrema6(a::Vector{T}) where T<:Int64
    vmin = vmax = @inbounds a[1]
    for (v, i) in enumerate(a)
        vmin = v < vmin ? v : vmin
        vmax = v > vmax ? v : vmax
    end
    return (vmin, vmax)
end

julia> @benchmark extrema6($a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     256.351 ns (0.00% GC)
  median time:      256.633 ns (0.00% GC)
  mean time:        269.742 ns (0.00% GC)
  maximum time:     616.575 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     362

enumerate seems to outperform eachindex, iterate, and for v in a

Now curious on your last comment on alignement starting with 1 is better
Watch this

function extrema2(a::Vector{Int64})
    vmin = vmax = @inbounds a[1]
    for i = 2:length(a)
        v = @inbounds a[i]
        vmin = v < vmin ? v : vmin
        vmax = v > vmax ? v : vmax
    end
    return (vmin, vmax)
end

julia> @benchmark extrema2($a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     272.443 ns (0.00% GC)
  median time:      273.068 ns (0.00% GC)
  mean time:        287.159 ns (0.00% GC)
  maximum time:     635.294 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     323

Same but starting on 1 take a 3.2 perf hit

function extrema2(a::Vector{Int64})
    vmin = vmax = @inbounds a[1]
    for i = 1:length(a)
        v = @inbounds a[i]
        vmin = v < vmin ? v : vmin
        vmax = v > vmax ? v : vmax
    end
    return (vmin, vmax)
end

julia> @benchmark extrema2($a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     885.436 ns (0.00% GC)
  median time:      890.909 ns (0.00% GC)
  mean time:        934.109 ns (0.00% GC)
  maximum time:     2.149 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     55

Oscar_Smith · December 13, 2020, 7:42am

That looks like a memory alignment thing that might not be reproducible.

Topic		Replies	Views
Replicate @tturbo performance Performance	23	2430	August 23, 2022
Perf improvement suggestion for simple extrema Performance proposal	4	651	February 18, 2020
Why isn't minmax branchless? Internals & Design	22	4581	July 20, 2021
Min/max swap using SIMD Performance simd	18	240	March 12, 2025
Branch elimination in Julia Performance question	7	1072	July 22, 2020

Branching in loop decrease perf by ~x5

Related topics