# Branching in loop decrease perf by ~x5

Hi,

In Julia 1.3.x version extrema function use to be much slower that minimum + maximum execution. At that time I wrote my own specialized extrema function and gained ~x3 by avoiding one test out of the two in each iteration (algo change).
I recently noticed that this extrema performance issue was corrected in the current 1.5.3. and now extrema is performing as fast as mimimum + maximum.
However when replicating the algo change trying to avoid one test on each iteration I got a counter perf by a factor of ~5
See test code and bench results below.
Anyone has an idea on how to keep the algo improvement and gain in perf? It seems that the branching is fooling LLVM optimization.
Thanks

``````using BenchmarkTools
n = 1000;
a = rand(1:n, n);

function extrema1(a::Vector{Int64})
vmin = vmax = @inbounds a
for i = 2:length(a)
v = @inbounds a[i]
if v < vmin
vmin = v
end
if v > vmax
vmax = v
end
end
return (vmin, vmax)
end

function extrema2(a::Vector{Int64})
vmin = vmax = @inbounds a
for i = 2:length(a)
v = @inbounds a[i]
if v < vmin
vmin = v
elseif v > vmax
vmax = v
end
end
return (vmin, vmax)
end

function extrema3(a::Vector{Int64})
vmin = vmax = @inbounds a
for i = 2:length(a)
v = @inbounds a[i]
if v < vmin
vmin = v
continue
end
if v > vmax
vmax = v
end
end
return (vmin, vmax)
end
``````
``````julia> @benchmark extrema(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     272.136 ns (0.00% GC)
median time:      272.755 ns (0.00% GC)
mean time:        284.915 ns (0.00% GC)
maximum time:     589.474 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     323

julia> @benchmark extrema1(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     268.232 ns (0.00% GC)
median time:      268.824 ns (0.00% GC)
mean time:        279.133 ns (0.00% GC)
maximum time:     705.588 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     340

julia> @benchmark extrema2(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.250 μs (0.00% GC)
median time:      1.250 μs (0.00% GC)
mean time:        1.304 μs (0.00% GC)
maximum time:     3.680 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

julia> @benchmark extrema3(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.250 μs (0.00% GC)
median time:      1.290 μs (0.00% GC)
mean time:        1.358 μs (0.00% GC)
maximum time:     5.610 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

julia> VERSION
v"1.5.3"
``````

The problem is that the code uses SIMD instructions. Using:

``````@code_llvm debuginfo=:none extrema1(a)
``````

This is the vectorized loop body:

``````vector.body:                                      ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %24, %vector.body ]
%vec.phi20 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %25, %vector.body ]
%vec.phi21 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %26, %vector.body ]
%vec.phi22 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %27, %vector.body ]
%vec.phi23 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %32, %vector.body ]
%vec.phi24 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %33, %vector.body ]
%vec.phi25 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %34, %vector.body ]
%vec.phi26 = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %35, %vector.body ]
%offset.idx = or i64 %index, 2
%11 = add nsw i64 %offset.idx, -1
%12 = getelementptr inbounds i64, i64* %3, i64 %11
%13 = bitcast i64* %12 to <8 x i64>*
%wide.load = load <8 x i64>, <8 x i64>* %13, align 8
%14 = getelementptr inbounds i64, i64* %12, i64 8
%15 = bitcast i64* %14 to <8 x i64>*
%wide.load27 = load <8 x i64>, <8 x i64>* %15, align 8
%16 = getelementptr inbounds i64, i64* %12, i64 16
%17 = bitcast i64* %16 to <8 x i64>*
%wide.load28 = load <8 x i64>, <8 x i64>* %17, align 8
%18 = getelementptr inbounds i64, i64* %12, i64 24
%19 = bitcast i64* %18 to <8 x i64>*
%wide.load29 = load <8 x i64>, <8 x i64>* %19, align 8
%20 = icmp slt <8 x i64> %wide.load, %vec.phi
%21 = icmp slt <8 x i64> %wide.load27, %vec.phi20
%22 = icmp slt <8 x i64> %wide.load28, %vec.phi21
%23 = icmp slt <8 x i64> %wide.load29, %vec.phi22
%24 = select <8 x i1> %20, <8 x i64> %wide.load, <8 x i64> %vec.phi
%25 = select <8 x i1> %21, <8 x i64> %wide.load27, <8 x i64> %vec.phi20
%26 = select <8 x i1> %22, <8 x i64> %wide.load28, <8 x i64> %vec.phi21
%27 = select <8 x i1> %23, <8 x i64> %wide.load29, <8 x i64> %vec.phi22
%28 = icmp slt <8 x i64> %vec.phi23, %wide.load
%29 = icmp slt <8 x i64> %vec.phi24, %wide.load27
%30 = icmp slt <8 x i64> %vec.phi25, %wide.load28
%31 = icmp slt <8 x i64> %vec.phi26, %wide.load29
%32 = select <8 x i1> %28, <8 x i64> %wide.load, <8 x i64> %vec.phi23
%33 = select <8 x i1> %29, <8 x i64> %wide.load27, <8 x i64> %vec.phi24
%34 = select <8 x i1> %30, <8 x i64> %wide.load28, <8 x i64> %vec.phi25
%35 = select <8 x i1> %31, <8 x i64> %wide.load29, <8 x i64> %vec.phi26
%index.next = add i64 %index, 32
%36 = icmp eq i64 %index.next, %n.vec
br i1 %36, label %middle.block, label %vector.body

``````

On my computer, it is doing 8 comparisons on every iteration: 4 `slt` (signed less than) of the load vs the min, and 4 checks of if the max is less than the load. Each of these 8 comparisons is actually comparing 8 numbers, meaning 64 comparisons happen on every iteration of the loop!
64 comparisons, not to.

``````  %20 = icmp slt <8 x i64> %wide.load, %vec.phi
%24 = select <8 x i1> %20, <8 x i64> %wide.load, <8 x i64> %vec.phi
``````

That is a comparison with `<8 x i64>` that it uses to then select from those values.

Unfortunately, when doing many-per-comparison, it’s hard to branch.
If you add a branch saying that, if the first comparison triggers, don’t do the second, how is that supposed to work? What is the computer supposed to be doing, if each instruction does many?

You could mask off the subsequent comparisons, that is, do them anyway but ignore the result. That is, do the first 8 comparisons with one instruction. Then, do the second 8 as well, but selectively ignore these 8 based on how the first results went. That would give you 0 performance benefit – you’re still doing the same number of comparisons – but the performance penalty would also be pretty negligible.

Or, you could stop doing 8 comparisons per instruction, and then actually branch like your code says. That works too, but naturally has a massive performance hit. That is what you’re seeing.

BTW, this is faster:

``````julia> function extrema5(a::AbstractVector{T}) where {T}
vmin = typemax(T); vmax = typemin(T)
for i = eachindex(a)
v = @inbounds a[i]
if v < vmin
vmin = v
end
if v > vmax
vmax = v
end
end
return (vmin, vmax)
end
extrema5 (generic function with 1 method)

julia> @btime extrema1(\$a)
58.404 ns (0 allocations: 0 bytes)
(1, 1000)

julia> @btime extrema2(\$a)
671.128 ns (0 allocations: 0 bytes)
(1, 1000)

julia> @btime extrema5(\$a)
57.974 ns (0 allocations: 0 bytes)
(1, 1000)
``````

In general, it’s better to start iterating arrays on their starting value for some low level reasons (Julia aligns the start of arrays; loads that are aligned are often faster).

Also, most recent computers are limited to/capable of 4x 64bit numbers/instruction, not 8x. Thus I see a >10x performance hit, instead of a ~5x.

12 Likes

You can also replace if statements by multiplication and addition to avoid branching. This could enable SIMD use, but I am not sure if this is possible in julia.

in C something like this can be done

``````if (b > a) b = a;
``````
``````bool if_else = b > a;
b = a * if_else + b * !if_else;
``````
1 Like

The compiler normally does a good job making transformations like this, especially with integers, so things like this normally aren’t necessary.
Plus, that transform is actually further from the assembly you actually want.

Use `ifelse` if you want to avoid branches.

10 Likes

I did not know about `ifelse`. Thanks Elrod for your very clear and documented explanation on perf diff. and your excellent extrema5 version

Still thinking on how to adapt the single comp using a SIMD implementation or a parallel algo on CUDA…

In the mean time, tried some variations to compare iterator’s performance.

To set a ref Base.extrema

``````using BenchmarkTools
n = 1000;
a = rand(1:n, n);
``````
``````julia> @benchmark extrema(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     274.347 ns (0.00% GC)
median time:      274.675 ns (0.00% GC)
mean time:        288.293 ns (0.00% GC)
maximum time:     651.623 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     308
``````

your excellent extrema5 version, tests are slightly modified with ternay op without introducing and perf change.

``````function extrema5(a::AbstractVector{T}) where {T}
vmin = typemax(T)
vmax = typemin(T)
for i in eachindex(a)
v = @inbounds a[i]
#if v < vmin vmin = v end
#if v > vmax vmax = v end
vmin = v < vmin ? v : vmin
vmax = v > vmax ? v : vmax
end
return (vmin, vmax)
end
``````
``````julia> @benchmark extrema5(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     267.636 ns (0.00% GC)
median time:      267.638 ns (0.00% GC)
mean time:        281.006 ns (0.00% GC)
maximum time:     752.478 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     343
``````

And a faster one

``````function extrema6(a::Vector{T}) where T<:Int64
vmin = vmax = @inbounds a
for (v, i) in enumerate(a)
vmin = v < vmin ? v : vmin
vmax = v > vmax ? v : vmax
end
return (vmin, vmax)
end
``````
``````julia> @benchmark extrema6(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     256.351 ns (0.00% GC)
median time:      256.633 ns (0.00% GC)
mean time:        269.742 ns (0.00% GC)
maximum time:     616.575 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     362
``````

enumerate seems to outperform eachindex, iterate, and for v in a

Now curious on your last comment on alignement starting with 1 is better
Watch this

``````function extrema2(a::Vector{Int64})
vmin = vmax = @inbounds a
for i = 2:length(a)
v = @inbounds a[i]
vmin = v < vmin ? v : vmin
vmax = v > vmax ? v : vmax
end
return (vmin, vmax)
end
``````
``````julia> @benchmark extrema2(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     272.443 ns (0.00% GC)
median time:      273.068 ns (0.00% GC)
mean time:        287.159 ns (0.00% GC)
maximum time:     635.294 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     323
``````

Same but starting on 1 take a 3.2 perf hit

``````function extrema2(a::Vector{Int64})
vmin = vmax = @inbounds a
for i = 1:length(a)
v = @inbounds a[i]
vmin = v < vmin ? v : vmin
vmax = v > vmax ? v : vmax
end
return (vmin, vmax)
end
``````
``````julia> @benchmark extrema2(\$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     885.436 ns (0.00% GC)
median time:      890.909 ns (0.00% GC)
mean time:        934.109 ns (0.00% GC)
maximum time:     2.149 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     55
``````

That looks like a memory alignment thing that might not be reproducible.