Strange performance characteristics and regression with tuple recursion

jfeist · July 18, 2021, 11:21pm

I have a custom iterator that produces the unique indices of an M-dimensional array with exchange symmetry (i.e., where the value is the same for any permutation of the indices), which is the same as the nonzero indices of an M-dimensional “triangular” array, which fulfill i1 ≤ i2 ≤ i3 ≤ ... ≤ iM. The iterator is written using tail recursion over tuples. Today I noticed that the performance of this varies strongly with different versions of Julia, with in particular 1.6 and 1.7 showing regressions. First, here’s a MWE of the code, including the test to benchmark it:

using BenchmarkTools

println("Julia version $VERSION")

struct TstIter{M} n::Int end

function Base.iterate(iter::TstIter{M}) where M
    I = ntuple(i->1,Val(M))
    I, I
end
function Base.iterate(iter::TstIter, state)
    valid, I = __inc(iter, state)
    return valid ? (I, I) : nothing
end

__inc(iter::TstIter, state::Tuple{Int}) = (state[1] < iter.n, (state[1]+1,))
function __inc(iter::TstIter, state::NTuple{N,Int}) where {N}
    state[1] < state[2] && return true, (state[1]+1, Base.tail(state)...)
    valid, I = __inc(iter, Base.tail(state))
    return valid, (1, I...)
end

function f(iter)
    s = 0
    for I in iter
        s += first(I)
    end
    s
end

n = 50
for M in 1:4
    iter = TstIter{M}(n)
    # check that number of elements is correct
    @assert count(x->true,iter) == binomial(n+M-1,M)

    print("M = $M: ")
    @btime f($iter)
end

Running this with the different versions of Julia that I have available (all on the same machine) gives:

Julia version 1.3.1
M = 1:   2.023 ns (0 allocations: 0 bytes)
M = 2:   629.188 ns (0 allocations: 0 bytes)
M = 3:   11.245 μs (0 allocations: 0 bytes)
M = 4:   169.320 μs (0 allocations: 0 bytes)

Julia version 1.4.0
M = 1:   2.022 ns (0 allocations: 0 bytes)
M = 2:   106.376 ns (0 allocations: 0 bytes)
M = 3:   2.792 μs (0 allocations: 0 bytes)
M = 4:   48.385 μs (0 allocations: 0 bytes)

Julia version 1.5.1
M = 1:   1.692 ns (0 allocations: 0 bytes)
M = 2:   108.645 ns (0 allocations: 0 bytes)
M = 3:   2.674 μs (0 allocations: 0 bytes)
M = 4:   51.156 μs (0 allocations: 0 bytes)

Julia version 1.6.1
M = 1:   1.693 ns (0 allocations: 0 bytes)
M = 2:   1.865 μs (0 allocations: 0 bytes)
M = 3:   2.463 μs (0 allocations: 0 bytes)
M = 4:   44.355 μs (0 allocations: 0 bytes)

Julia version 1.7.0-beta3.0
M = 1:   2.025 ns (0 allocations: 0 bytes)
M = 2:   1.868 μs (0 allocations: 0 bytes)
M = 3:   12.798 μs (0 allocations: 0 bytes)
M = 4:   207.063 μs (0 allocations: 0 bytes)

Some things to note: There was a big improvement from Julia 1.3.1 to 1.4.0, which was maintained in 1.5.1. Then, in 1.6.1, the case M=2 (and only that one!?) shows a significant regression, going from ~110 ns to ~1.9 µs (almost as slow as M=3), and in 1.7.0-beta3, the performance for M=3 and M=4 also regressed significantly.
I’ve tried quite a few different versions of the code (with v1.6, didn’t do extensive testing with v1.7), but did not find any way to get the performance of v1.5 back with v1.6 (e.g., adding @inlines didn’t help, nor did adding an explicit specialization of __inc for a tuple of 2 Ints (which is where the regression happened).

Long story short: Any ideas what’s going on, and/or suggestions on how to write this code to get the same performance on v1.6 or v1.7 as on v1.5? Am I doing something silly in the test itself? I realize this is a microbenchmark, but I’d like to know what’s going on regardless.
BTW, these timings are from a linux machine, but I also cross-checked on a macbook and found the same behaviour.

For context: the code is a simplification of an iterator from SymArrays.jl, which provides a type to (hopefully) efficiently represent and use arrays with exchange symmetries between (some) groups of indices.

anon56330260 · July 19, 2021, 6:41am

Some hints here:
The regression of M = 2 case is caused by a missing vectorization.
In Julia 1.5.2, @code_llvm f(iter) gives

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i65 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.ind = phi <4 x i64> [ <i64 -1, i64 0, i64 1, i64 2>, %vector.ph ], [ %vec.ind.next, %vector.body ]
  %vec.ind36 = phi <4 x i65> [ <i65 0, i65 1, i65 2, i65 3>, %vector.ph ], [ %vec.ind.next39, %vector.body ]
  %vec.ind40 = phi <4 x i64> [ <i64 1, i64 3, i64 5, i64 7>, %vector.ph ], [ %vec.ind.next43, %vector.body ]
  %vec.phi = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %18, %vector.body ]
  %vec.phi44 = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %19, %vector.body ]
  %step.add = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
  %step.add37 = add <4 x i65> %vec.ind36, <i65 4, i65 4, i65 4, i65 4>
  %step.add41 = add <4 x i64> %vec.ind40, <i64 8, i64 8, i64 8, i64 8>
  %8 = zext <4 x i64> %vec.ind to <4 x i65>
  %9 = zext <4 x i64> %step.add to <4 x i65>
  %10 = mul <4 x i65> %vec.ind36, %8
  %11 = mul <4 x i65> %step.add37, %9
  %12 = lshr <4 x i65> %10, <i65 1, i65 1, i65 1, i65 1>
  %13 = lshr <4 x i65> %11, <i65 1, i65 1, i65 1, i65 1>
  %14 = trunc <4 x i65> %12 to <4 x i64>
  %15 = trunc <4 x i65> %13 to <4 x i64>
  %16 = add <4 x i64> %vec.phi, %vec.ind40
  %17 = add <4 x i64> %vec.phi44, %step.add41
  %18 = add <4 x i64> %16, %14
  %19 = add <4 x i64> %17, %15
  %index.next = add i65 %index, 8
  %vec.ind.next = add <4 x i64> %step.add, <i64 4, i64 4, i64 4, i64 4>
  %vec.ind.next39 = add <4 x i65> %step.add37, <i65 4, i65 4, i65 4, i65 4>
  %vec.ind.next43 = add <4 x i64> %step.add41, <i64 8, i64 8, i64 8, i64 8>
  %20 = icmp eq i65 %index.next, %n.vec
  br i1 %20, label %middle.block, label %vector.body

In Julia 1.6.2:

julia> @code_llvm debuginfo=:none f(iter)
define i64 @julia_f_2599([1 x i64]* nocapture nonnull readonly align 8 dereferenceable(8) %0) {
top:
  %1 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
  %2 = load i64, i64* %1, align 8
  br label %L2

L2:                                               ; preds = %L2, %top
  %value_phi = phi i64 [ 1, %top ], [ %value_phi8, %L2 ]
  %value_phi2 = phi i64 [ 1, %top ], [ %value_phi10, %L2 ]
  %value_phi6 = phi i64 [ 0, %top ], [ %3, %L2 ]
  %3 = add i64 %value_phi6, %value_phi
  %.not = icmp slt i64 %value_phi, %value_phi2
  %4 = add i64 %value_phi, 1
  %5 = icmp slt i64 %value_phi2, %2
  %value_phi7 = or i1 %.not, %5
  %value_phi8 = select i1 %.not, i64 %4, i64 1
  %not..not = xor i1 %.not, true
  %6 = zext i1 %not..not to i64
  %value_phi10 = add i64 %value_phi2, %6
  br i1 %value_phi7, label %L2, label %L38

L38:                                              ; preds = %L2
  ret i64 %3
}

While M=4 the vectorization happens…so there is no regression.

jfeist · July 19, 2021, 9:18am

Thanks! Any idea/suggestion on how to get it to vectorize again?
On a hunch, I tried replacing a branch by ifelse:

function Base.iterate(iter::TstIter, state)
    valid, I = __inc(iter, state)
    ifelse(valid, (I, I), nothing)
end

This fixes the regression for M=2 in 1.6 (I updated from 1.6.1 to 1.6.2 since yesterday, but that didn’t change anything), but gives even worse performance on 1.7.0-beta3:

Julia version 1.6.2
M = 1:   2.025 ns (0 allocations: 0 bytes)
M = 2:   94.756 ns (0 allocations: 0 bytes)
M = 3:   2.456 μs (0 allocations: 0 bytes)
M = 4:   44.358 μs (0 allocations: 0 bytes)

Julia version 1.7.0-beta3.0
M = 1:   34.362 ns (0 allocations: 0 bytes)
M = 2:   2.031 μs (0 allocations: 0 bytes)
M = 3:   25.236 μs (0 allocations: 0 bytes)
M = 4:   347.549 μs (0 allocations: 0 bytes)

anon56330260 · July 19, 2021, 10:02am

For someone who is interested in LLVM, the problem is caused by the (second) simplification of CFG pass.
Before this pass, both Julia 1.6.2 and 1.5.2 generates almost the same IR (not the same, but differs by a permutation). Before the pass, It’s a cfg has 6 blocks:

*** IR Dump After Combine redundant instructions ***
define i64 @julia_f_34([1 x i64] addrspace(11)* nocapture nonnull readonly align 8 dereferenceable(8) %0) !dbg !5 {
top:
  %1 = call {}*** @julia.ptls_states()
  br label %L2, !dbg !7

L2:                                               ; preds = %L18, %top
  %value_phi = phi i64 [ 1, %top ], [ %value_phi8, %L18 ]
  %value_phi2 = phi i64 [ 1, %top ], [ %value_phi10, %L18 ]
  %value_phi6 = phi i64 [ 0, %top ], [ %2, %L18 ]
  %2 = add i64 %value_phi6, %value_phi, !dbg !8
  %.not = icmp slt i64 %value_phi, %value_phi2, !dbg !12
  br i1 %.not, label %L12, label %L14, !dbg !14

L12:                                              ; preds = %L2
  %3 = add i64 %value_phi, 1, !dbg !20
  br label %L18, !dbg !14

L14:                                              ; preds = %L2
  %4 = getelementptr inbounds [1 x i64], [1 x i64] addrspace(11)* %0, i64 0, i64 0, !dbg !21
  %5 = load i64, i64 addrspace(11)* %4, align 8, !dbg !28, !tbaa !29, !invariant.load !4
  %6 = icmp slt i64 %value_phi2, %5, !dbg !28
  %7 = add i64 %value_phi2, 1, !dbg !33
  br label %L18, !dbg !34

L18:                                              ; preds = %L14, %L12
  %value_phi7 = phi i1 [ true, %L12 ], [ %6, %L14 ]
  %value_phi8 = phi i64 [ %3, %L12 ], [ 1, %L14 ]
  %value_phi10 = phi i64 [ %value_phi2, %L12 ], [ %7, %L14 ]
  br i1 %value_phi7, label %L2, label %L38, !dbg !11

L38:                                              ; preds = %L18
  ret i64 %2, !dbg !35
}

After this pass, Julia 1.6.2 merges block L12,L18 into L14 and turn all the phi nodes in L18 to select (we have only 3 blocks here), while Julia 1.5.2 does nothing. IR after this pass in Julia 1.6.2:

*** IR Dump After Simplify the CFG ***
define i64 @julia_f_34([1 x i64] addrspace(11)* nocapture nonnull readonly align 8 dereferenceable(8) %0) !dbg !5 {
top:
  %1 = call {}*** @julia.ptls_states()
  br label %L2, !dbg !7

L2:                                               ; preds = %L2, %top
  %value_phi = phi i64 [ 1, %top ], [ %value_phi8, %L2 ]
  %value_phi2 = phi i64 [ 1, %top ], [ %value_phi10, %L2 ]
  %value_phi6 = phi i64 [ 0, %top ], [ %2, %L2 ]
  %2 = add i64 %value_phi6, %value_phi, !dbg !8
  %.not = icmp slt i64 %value_phi, %value_phi2, !dbg !12
  %3 = add i64 %value_phi, 1, !dbg !14
  %4 = getelementptr inbounds [1 x i64], [1 x i64] addrspace(11)* %0, i64 0, i64 0, !dbg !14
  %5 = load i64, i64 addrspace(11)* %4, align 8, !dbg !14
  %6 = icmp slt i64 %value_phi2, %5, !dbg !14
  %7 = add i64 %value_phi2, 1, !dbg !14
  %value_phi7 = select i1 %.not, i1 true, i1 %6, !dbg !14
  %value_phi8 = select i1 %.not, i64 %3, i64 1, !dbg !14
  %value_phi10 = select i1 %.not, i64 %value_phi2, i64 %7, !dbg !14
  br i1 %value_phi7, label %L2, label %L38, !dbg !11

L38:                                              ; preds = %L2
  ret i64 %2, !dbg !20
}

However, these selects ruins the following vectorization pass, where in Julia 1.5.2 the unmerged L18 block get eliminated by jump threading and the loop rotation pass transforms the loop to a surprisingly simple form with no selects:

*** IR Dump After Rotate Loops ***
; Preheader:
L2.outer:                                         ; preds = %L18, %top
  %value_phi.ph = phi i64 [ %.lcssa, %L18 ], [ 0, %top ]
  %value_phi3.ph = phi i64 [ %8, %L18 ], [ 1, %top ]
  br label %L2, !dbg !21

; Loop:
L2:                                               ; preds = %L2, %L2.outer
  %value_phi = phi i64 [ %3, %L2 ], [ %value_phi.ph, %L2.outer ]
  %value_phi1 = phi i64 [ %5, %L2 ], [ 1, %L2.outer ]
  %3 = add i64 %value_phi1, %value_phi, !dbg !22
  %4 = icmp slt i64 %value_phi1, %value_phi3.ph, !dbg !25
  %5 = add i64 %value_phi1, 1, !dbg !27
  br i1 %4, label %L2, label %L18, !dbg !21

; Exit blocks
L18:                                              ; preds = %L2
  %.lcssa = phi i64 [ %3, %L2 ], !dbg !22
  %6 = load i64, i64 addrspace(11)* %2, align 8, !dbg !28, !tbaa !29, !invariant.load !4
  %7 = icmp slt i64 %value_phi3.ph, %6, !dbg !28
  %8 = add i64 %value_phi3.ph, 1, !dbg !33
  br i1 %7, label %L2.outer, label %L38, !dbg !19

I guess Julia or LLVM changes some parameters of this pass. So this pass become much more powerful in newer Julia, but it misses the vetorization chance, which leads to the bad performance.

anon56330260 · July 19, 2021, 10:07am

I don’t know how to get rid of the regression. It’s an interaction of multiple optimization passes, which is nontrivial. I have tried rewriting the code, but no luck. I think currently the best you can do is to stick to an older version of Julia (or use ifelse as a workaround in Julia 1.6) and wait for the fix. Have you opened an issue on Github?

jfeist · July 19, 2021, 10:14am

Awesome detective work, thanks a lot! I opened an issue this morning (Performance regression with tuple tail recursion in v1.7 · Issue #41637 · JuliaLang/julia · GitHub), pointing at this thread.

Topic		Replies	Views
Performance of recursively building tuples Performance	6	546	November 30, 2020
Drop of performances with Julia 1.6.0 for InterpolationKernels Performance	40	2486	April 1, 2021
Manually unroll operations with objects of tuple New to Julia question , metaprogramming , unrolling	21	2274	June 13, 2018
Recursive call vs while loop Performance recursion	21	12266	April 12, 2022
Making iterators as fast as loops General Usage question	25	4026	June 14, 2020

Strange performance characteristics and regression with tuple recursion

Related topics