Performance of recursively building tuples

I was reading through the source code for broadcasting and CartesianIndices, and I noticed that there were many methods that recursively operated on tuples. I thought it strange because I thought iteration was more intuitive, but I figured there might be some sort of tail call optimization going on.

So I wrote a short script to compare the tail-recursive way and an iterative way.

# tail-recursive way
f(x) = (a(x[1]), f(Base.tail(x))...)
f(x::Tuple{Any}) = a(x[1]) 

# iterative way
g(x) = Tuple(a(e) for e in x)

# operation on tuple element
a(e) = 2*e

x = Tuple(1:1000) # the input tuple

@time f(x) # 14.826541 seconds (9.32 M allocations: 447.837 MiB, 5.50% gc time)
@time f(x) # 0.198740 seconds (842.29 k allocations: 35.926 MiB, 3.62% gc time)
@time g(x) # 0.031490 seconds (74.45 k allocations: 3.950 MiB)
@time g(x) # 0.000074 seconds (753 allocations: 43.484 KiB)

It appears that the iterative way is actually more performant. Can someone explain this result and why tail recursion is used to build tuples so often?

Well, the tuples in question are dimensions of arrays, which are like length 5 in practice, which the optimizer is ok with. We’ve been planning to add optimizations for long tuples, but that hasn’t been a priority yet.

3 Likes

So are relatively short (say <10 elements) tuples optimized specially or is it just that the performance hits only start adding up at larger lengths?

Bit of both. There are limits in the compiler that penalize extra long tuples.

3 Likes

The timing profile is quite interesting:

julia> x=Tuple(1:30);@btime f(x);
  204.906 ns (4 allocations: 320 bytes)

julia> x=Tuple(1:30);@btime g(x);
  843.543 ns (2 allocations: 592 bytes)

julia> x=Tuple(1:40);@btime f(x);
  16.220 μs (34 allocations: 7.39 KiB)

julia> x=Tuple(1:40);@btime g(x);
  1.102 μs (2 allocations: 736 bytes)

I think it’s useful to look at @code_typed for small tuples, to see that the recursive version generates the same code that you would write by hand, so it is extremely efficient.

julia> mylog(a::Tuple{}) = ()
mylog (generic function with 1 method)

julia> mylog(a::Tuple) = (log(first(a)), mylog(Base.tail(a))...)
mylog (generic function with 2 methods)

julia> @code_typed mylog((1.0, 2.0, 3.0))
CodeInfo(
1 ─ %1 = Base.getfield(a, 1, true)::Float64
│   %2 = invoke Main.log(%1::Float64)::Float64
│   %3 = (getfield)(a, 2)::Float64
│   %4 = (getfield)(a, 3)::Float64
│   %5 = invoke Main.log(%3::Float64)::Float64
│   %6 = invoke Main.log(%4::Float64)::Float64
│   %7 = Core.tuple(%2, %5, %6)::Tuple{Float64,Float64,Float64}
└──      return %7
) => Tuple{Float64,Float64,Float64}

My understanding is that this doesn’t really work for long tuples, so Base julia “bails out” and uses a different approach for length over 16, see here for example.

In practice, I guess in most cases you could use map, foldl, and filter, which are optimized for tuples of any length (i.e., they change strategy for long tuples).

2 Likes

Tried @code_warntype for f and g in my script and saw a possible reason why the recursive version was preferred: f's return type was inferred as the concrete NTuple{1000,Int64}, but g's return type was inferred as the abstract Tuple{Vararg{Int64,N} where N}. I don’t actually know how much of an issue this type instability would be, though.
Weirdly, editing g to g(x::NTuple{N,T}) where {N,T} = NTuple{N,T}(a(e) for e in x) didn’t change anything.