I have a custom iterator that produces the unique indices of an M-dimensional array with exchange symmetry (i.e., where the value is the same for any permutation of the indices), which is the same as the nonzero indices of an M-dimensional “triangular” array, which fulfill i1 ≤ i2 ≤ i3 ≤ ... ≤ iM
. The iterator is written using tail recursion over tuples. Today I noticed that the performance of this varies strongly with different versions of Julia, with in particular 1.6 and 1.7 showing regressions. First, here’s a MWE of the code, including the test to benchmark it:
using BenchmarkTools
println("Julia version $VERSION")
struct TstIter{M} n::Int end
function Base.iterate(iter::TstIter{M}) where M
I = ntuple(i->1,Val(M))
I, I
end
function Base.iterate(iter::TstIter, state)
valid, I = __inc(iter, state)
return valid ? (I, I) : nothing
end
__inc(iter::TstIter, state::Tuple{Int}) = (state[1] < iter.n, (state[1]+1,))
function __inc(iter::TstIter, state::NTuple{N,Int}) where {N}
state[1] < state[2] && return true, (state[1]+1, Base.tail(state)...)
valid, I = __inc(iter, Base.tail(state))
return valid, (1, I...)
end
function f(iter)
s = 0
for I in iter
s += first(I)
end
s
end
n = 50
for M in 1:4
iter = TstIter{M}(n)
# check that number of elements is correct
@assert count(x->true,iter) == binomial(n+M-1,M)
print("M = $M: ")
@btime f($iter)
end
Running this with the different versions of Julia that I have available (all on the same machine) gives:
Julia version 1.3.1
M = 1: 2.023 ns (0 allocations: 0 bytes)
M = 2: 629.188 ns (0 allocations: 0 bytes)
M = 3: 11.245 μs (0 allocations: 0 bytes)
M = 4: 169.320 μs (0 allocations: 0 bytes)
Julia version 1.4.0
M = 1: 2.022 ns (0 allocations: 0 bytes)
M = 2: 106.376 ns (0 allocations: 0 bytes)
M = 3: 2.792 μs (0 allocations: 0 bytes)
M = 4: 48.385 μs (0 allocations: 0 bytes)
Julia version 1.5.1
M = 1: 1.692 ns (0 allocations: 0 bytes)
M = 2: 108.645 ns (0 allocations: 0 bytes)
M = 3: 2.674 μs (0 allocations: 0 bytes)
M = 4: 51.156 μs (0 allocations: 0 bytes)
Julia version 1.6.1
M = 1: 1.693 ns (0 allocations: 0 bytes)
M = 2: 1.865 μs (0 allocations: 0 bytes)
M = 3: 2.463 μs (0 allocations: 0 bytes)
M = 4: 44.355 μs (0 allocations: 0 bytes)
Julia version 1.7.0-beta3.0
M = 1: 2.025 ns (0 allocations: 0 bytes)
M = 2: 1.868 μs (0 allocations: 0 bytes)
M = 3: 12.798 μs (0 allocations: 0 bytes)
M = 4: 207.063 μs (0 allocations: 0 bytes)
Some things to note: There was a big improvement from Julia 1.3.1 to 1.4.0, which was maintained in 1.5.1. Then, in 1.6.1, the case M=2 (and only that one!?) shows a significant regression, going from ~110 ns to ~1.9 µs (almost as slow as M=3), and in 1.7.0-beta3, the performance for M=3 and M=4 also regressed significantly.
I’ve tried quite a few different versions of the code (with v1.6, didn’t do extensive testing with v1.7), but did not find any way to get the performance of v1.5 back with v1.6 (e.g., adding @inline
s didn’t help, nor did adding an explicit specialization of __inc
for a tuple of 2 Ints (which is where the regression happened).
Long story short: Any ideas what’s going on, and/or suggestions on how to write this code to get the same performance on v1.6 or v1.7 as on v1.5? Am I doing something silly in the test itself? I realize this is a microbenchmark, but I’d like to know what’s going on regardless.
BTW, these timings are from a linux machine, but I also cross-checked on a macbook and found the same behaviour.
For context: the code is a simplification of an iterator from SymArrays.jl, which provides a type to (hopefully) efficiently represent and use arrays with exchange symmetries between (some) groups of indices.