# Strange performance characteristics and regression with tuple recursion

I have a custom iterator that produces the unique indices of an M-dimensional array with exchange symmetry (i.e., where the value is the same for any permutation of the indices), which is the same as the nonzero indices of an M-dimensional “triangular” array, which fulfill `i1 ≤ i2 ≤ i3 ≤ ... ≤ iM`. The iterator is written using tail recursion over tuples. Today I noticed that the performance of this varies strongly with different versions of Julia, with in particular 1.6 and 1.7 showing regressions. First, here’s a MWE of the code, including the test to benchmark it:

``````using BenchmarkTools

println("Julia version \$VERSION")

struct TstIter{M} n::Int end

function Base.iterate(iter::TstIter{M}) where M
I = ntuple(i->1,Val(M))
I, I
end
function Base.iterate(iter::TstIter, state)
valid, I = __inc(iter, state)
return valid ? (I, I) : nothing
end

__inc(iter::TstIter, state::Tuple{Int}) = (state[1] < iter.n, (state[1]+1,))
function __inc(iter::TstIter, state::NTuple{N,Int}) where {N}
state[1] < state[2] && return true, (state[1]+1, Base.tail(state)...)
valid, I = __inc(iter, Base.tail(state))
return valid, (1, I...)
end

function f(iter)
s = 0
for I in iter
s += first(I)
end
s
end

n = 50
for M in 1:4
iter = TstIter{M}(n)
# check that number of elements is correct
@assert count(x->true,iter) == binomial(n+M-1,M)

print("M = \$M: ")
@btime f(\$iter)
end

``````

Running this with the different versions of Julia that I have available (all on the same machine) gives:

``````Julia version 1.3.1
M = 1:   2.023 ns (0 allocations: 0 bytes)
M = 2:   629.188 ns (0 allocations: 0 bytes)
M = 3:   11.245 μs (0 allocations: 0 bytes)
M = 4:   169.320 μs (0 allocations: 0 bytes)

Julia version 1.4.0
M = 1:   2.022 ns (0 allocations: 0 bytes)
M = 2:   106.376 ns (0 allocations: 0 bytes)
M = 3:   2.792 μs (0 allocations: 0 bytes)
M = 4:   48.385 μs (0 allocations: 0 bytes)

Julia version 1.5.1
M = 1:   1.692 ns (0 allocations: 0 bytes)
M = 2:   108.645 ns (0 allocations: 0 bytes)
M = 3:   2.674 μs (0 allocations: 0 bytes)
M = 4:   51.156 μs (0 allocations: 0 bytes)

Julia version 1.6.1
M = 1:   1.693 ns (0 allocations: 0 bytes)
M = 2:   1.865 μs (0 allocations: 0 bytes)
M = 3:   2.463 μs (0 allocations: 0 bytes)
M = 4:   44.355 μs (0 allocations: 0 bytes)

Julia version 1.7.0-beta3.0
M = 1:   2.025 ns (0 allocations: 0 bytes)
M = 2:   1.868 μs (0 allocations: 0 bytes)
M = 3:   12.798 μs (0 allocations: 0 bytes)
M = 4:   207.063 μs (0 allocations: 0 bytes)
``````

Some things to note: There was a big improvement from Julia 1.3.1 to 1.4.0, which was maintained in 1.5.1. Then, in 1.6.1, the case M=2 (and only that one!?) shows a significant regression, going from ~110 ns to ~1.9 µs (almost as slow as M=3), and in 1.7.0-beta3, the performance for M=3 and M=4 also regressed significantly.
I’ve tried quite a few different versions of the code (with v1.6, didn’t do extensive testing with v1.7), but did not find any way to get the performance of v1.5 back with v1.6 (e.g., adding `@inline`s didn’t help, nor did adding an explicit specialization of `__inc` for a tuple of 2 Ints (which is where the regression happened).

Long story short: Any ideas what’s going on, and/or suggestions on how to write this code to get the same performance on v1.6 or v1.7 as on v1.5? Am I doing something silly in the test itself? I realize this is a microbenchmark, but I’d like to know what’s going on regardless.
BTW, these timings are from a linux machine, but I also cross-checked on a macbook and found the same behaviour.

For context: the code is a simplification of an iterator from SymArrays.jl, which provides a type to (hopefully) efficiently represent and use arrays with exchange symmetries between (some) groups of indices.

1 Like

Some hints here:
The regression of M = 2 case is caused by a missing vectorization.
In Julia 1.5.2, `@code_llvm f(iter)` gives

``````vector.body:                                      ; preds = %vector.body, %vector.ph
%index = phi i65 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.ind = phi <4 x i64> [ <i64 -1, i64 0, i64 1, i64 2>, %vector.ph ], [ %vec.ind.next, %vector.body ]
%vec.ind36 = phi <4 x i65> [ <i65 0, i65 1, i65 2, i65 3>, %vector.ph ], [ %vec.ind.next39, %vector.body ]
%vec.ind40 = phi <4 x i64> [ <i64 1, i64 3, i64 5, i64 7>, %vector.ph ], [ %vec.ind.next43, %vector.body ]
%vec.phi = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %18, %vector.body ]
%vec.phi44 = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %19, %vector.body ]
%step.add = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
%step.add37 = add <4 x i65> %vec.ind36, <i65 4, i65 4, i65 4, i65 4>
%step.add41 = add <4 x i64> %vec.ind40, <i64 8, i64 8, i64 8, i64 8>
%8 = zext <4 x i64> %vec.ind to <4 x i65>
%9 = zext <4 x i64> %step.add to <4 x i65>
%10 = mul <4 x i65> %vec.ind36, %8
%11 = mul <4 x i65> %step.add37, %9
%12 = lshr <4 x i65> %10, <i65 1, i65 1, i65 1, i65 1>
%13 = lshr <4 x i65> %11, <i65 1, i65 1, i65 1, i65 1>
%14 = trunc <4 x i65> %12 to <4 x i64>
%15 = trunc <4 x i65> %13 to <4 x i64>
%16 = add <4 x i64> %vec.phi, %vec.ind40
%18 = add <4 x i64> %16, %14
%19 = add <4 x i64> %17, %15
%index.next = add i65 %index, 8
%vec.ind.next = add <4 x i64> %step.add, <i64 4, i64 4, i64 4, i64 4>
%vec.ind.next39 = add <4 x i65> %step.add37, <i65 4, i65 4, i65 4, i65 4>
%vec.ind.next43 = add <4 x i64> %step.add41, <i64 8, i64 8, i64 8, i64 8>
%20 = icmp eq i65 %index.next, %n.vec
br i1 %20, label %middle.block, label %vector.body
``````

In Julia 1.6.2:

``````julia> @code_llvm debuginfo=:none f(iter)
define i64 @julia_f_2599([1 x i64]* nocapture nonnull readonly align 8 dereferenceable(8) %0) {
top:
%1 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
%2 = load i64, i64* %1, align 8
br label %L2

L2:                                               ; preds = %L2, %top
%value_phi = phi i64 [ 1, %top ], [ %value_phi8, %L2 ]
%value_phi2 = phi i64 [ 1, %top ], [ %value_phi10, %L2 ]
%value_phi6 = phi i64 [ 0, %top ], [ %3, %L2 ]
%3 = add i64 %value_phi6, %value_phi
%.not = icmp slt i64 %value_phi, %value_phi2
%4 = add i64 %value_phi, 1
%5 = icmp slt i64 %value_phi2, %2
%value_phi7 = or i1 %.not, %5
%value_phi8 = select i1 %.not, i64 %4, i64 1
%not..not = xor i1 %.not, true
%6 = zext i1 %not..not to i64
%value_phi10 = add i64 %value_phi2, %6
br i1 %value_phi7, label %L2, label %L38

L38:                                              ; preds = %L2
ret i64 %3
}
``````

While M=4 the vectorization happens…so there is no regression.

Thanks! Any idea/suggestion on how to get it to vectorize again?
On a hunch, I tried replacing a branch by ifelse:

``````function Base.iterate(iter::TstIter, state)
valid, I = __inc(iter, state)
ifelse(valid, (I, I), nothing)
end
``````

This fixes the regression for M=2 in 1.6 (I updated from 1.6.1 to 1.6.2 since yesterday, but that didn’t change anything), but gives even worse performance on 1.7.0-beta3:

``````Julia version 1.6.2
M = 1:   2.025 ns (0 allocations: 0 bytes)
M = 2:   94.756 ns (0 allocations: 0 bytes)
M = 3:   2.456 μs (0 allocations: 0 bytes)
M = 4:   44.358 μs (0 allocations: 0 bytes)

Julia version 1.7.0-beta3.0
M = 1:   34.362 ns (0 allocations: 0 bytes)
M = 2:   2.031 μs (0 allocations: 0 bytes)
M = 3:   25.236 μs (0 allocations: 0 bytes)
M = 4:   347.549 μs (0 allocations: 0 bytes)
``````

For someone who is interested in LLVM, the problem is caused by the (second) simplification of CFG pass.
Before this pass, both Julia 1.6.2 and 1.5.2 generates almost the same IR (not the same, but differs by a permutation). Before the pass, It’s a cfg has 6 blocks:

``````*** IR Dump After Combine redundant instructions ***
define i64 @julia_f_34([1 x i64] addrspace(11)* nocapture nonnull readonly align 8 dereferenceable(8) %0) !dbg !5 {
top:
%1 = call {}*** @julia.ptls_states()
br label %L2, !dbg !7

L2:                                               ; preds = %L18, %top
%value_phi = phi i64 [ 1, %top ], [ %value_phi8, %L18 ]
%value_phi2 = phi i64 [ 1, %top ], [ %value_phi10, %L18 ]
%value_phi6 = phi i64 [ 0, %top ], [ %2, %L18 ]
%2 = add i64 %value_phi6, %value_phi, !dbg !8
%.not = icmp slt i64 %value_phi, %value_phi2, !dbg !12
br i1 %.not, label %L12, label %L14, !dbg !14

L12:                                              ; preds = %L2
%3 = add i64 %value_phi, 1, !dbg !20
br label %L18, !dbg !14

L14:                                              ; preds = %L2
%4 = getelementptr inbounds [1 x i64], [1 x i64] addrspace(11)* %0, i64 0, i64 0, !dbg !21
%6 = icmp slt i64 %value_phi2, %5, !dbg !28
%7 = add i64 %value_phi2, 1, !dbg !33
br label %L18, !dbg !34

L18:                                              ; preds = %L14, %L12
%value_phi7 = phi i1 [ true, %L12 ], [ %6, %L14 ]
%value_phi8 = phi i64 [ %3, %L12 ], [ 1, %L14 ]
%value_phi10 = phi i64 [ %value_phi2, %L12 ], [ %7, %L14 ]
br i1 %value_phi7, label %L2, label %L38, !dbg !11

L38:                                              ; preds = %L18
ret i64 %2, !dbg !35
}
``````

After this pass, Julia 1.6.2 merges block L12,L18 into L14 and turn all the phi nodes in L18 to `select` (we have only 3 blocks here), while Julia 1.5.2 does nothing. IR after this pass in Julia 1.6.2:

``````*** IR Dump After Simplify the CFG ***
define i64 @julia_f_34([1 x i64] addrspace(11)* nocapture nonnull readonly align 8 dereferenceable(8) %0) !dbg !5 {
top:
%1 = call {}*** @julia.ptls_states()
br label %L2, !dbg !7

L2:                                               ; preds = %L2, %top
%value_phi = phi i64 [ 1, %top ], [ %value_phi8, %L2 ]
%value_phi2 = phi i64 [ 1, %top ], [ %value_phi10, %L2 ]
%value_phi6 = phi i64 [ 0, %top ], [ %2, %L2 ]
%2 = add i64 %value_phi6, %value_phi, !dbg !8
%.not = icmp slt i64 %value_phi, %value_phi2, !dbg !12
%3 = add i64 %value_phi, 1, !dbg !14
%4 = getelementptr inbounds [1 x i64], [1 x i64] addrspace(11)* %0, i64 0, i64 0, !dbg !14
%6 = icmp slt i64 %value_phi2, %5, !dbg !14
%7 = add i64 %value_phi2, 1, !dbg !14
%value_phi7 = select i1 %.not, i1 true, i1 %6, !dbg !14
%value_phi8 = select i1 %.not, i64 %3, i64 1, !dbg !14
%value_phi10 = select i1 %.not, i64 %value_phi2, i64 %7, !dbg !14
br i1 %value_phi7, label %L2, label %L38, !dbg !11

L38:                                              ; preds = %L2
ret i64 %2, !dbg !20
}
``````

However, these `select`s ruins the following vectorization pass, where in Julia 1.5.2 the unmerged L18 block get eliminated by jump threading and the loop rotation pass transforms the loop to a surprisingly simple form with no selects:

``````*** IR Dump After Rotate Loops ***
L2.outer:                                         ; preds = %L18, %top
%value_phi.ph = phi i64 [ %.lcssa, %L18 ], [ 0, %top ]
%value_phi3.ph = phi i64 [ %8, %L18 ], [ 1, %top ]
br label %L2, !dbg !21

; Loop:
L2:                                               ; preds = %L2, %L2.outer
%value_phi = phi i64 [ %3, %L2 ], [ %value_phi.ph, %L2.outer ]
%value_phi1 = phi i64 [ %5, %L2 ], [ 1, %L2.outer ]
%3 = add i64 %value_phi1, %value_phi, !dbg !22
%4 = icmp slt i64 %value_phi1, %value_phi3.ph, !dbg !25
%5 = add i64 %value_phi1, 1, !dbg !27
br i1 %4, label %L2, label %L18, !dbg !21

; Exit blocks
L18:                                              ; preds = %L2
%.lcssa = phi i64 [ %3, %L2 ], !dbg !22
I don’t know how to get rid of the regression. It’s an interaction of multiple optimization passes, which is nontrivial. I have tried rewriting the code, but no luck. I think currently the best you can do is to stick to an older version of Julia (or use `ifelse` as a workaround in Julia 1.6) and wait for the fix. Have you opened an issue on Github?