# Efficient recursive tuple construction

I’m wondering if there is an efficient way to build a tuple recursively. I’m working with a three-term recurrence relationship (specifically, this one).

The first and second terms are known. From them, all the other terms can be computed, one-by-one. The number of terms is known at run time.

I can allocate a vector and fill in the sequence, but I’m wondering if there might be tricks to avoid doing such an allocation. I tried a recursive splatting (slurping?) approach to build the tuple,

``````function Tₖ(T::Tuple, ξ, n::Int)
if length(T) == n
return T
else
return Tₖ((T..., 2*ξ*T[end] - T[end-1]), ξ, n)
end
end
``````

but this results in horrible type instability when indexing the tuple.

Any ideas would be appreciated!

Here is something that may serve as inspiration:

``````julia> function Tₖ(t::Tuple,::Val{N},::Val{M}) where {N,M}
ntuple(i -> i <= N ? t[i] : 1, M)
end
Tₖ (generic function with 1 method)

julia> t = (0,0,0);

julia> @btime Tₖ(\$t,Val(length(\$t)),Val(8))
0.021 ns (0 allocations: 0 bytes)
(0, 0, 0, 1, 1, 1, 1, 1)

``````

Note that this will be compiled once for every combination of N and M, so that will be slow to be called recursively if that recursion occurs only once in your code. If you have to repeat the recursion multiple times, that that may work (obs: the `0.021 ns` is not meaninful there, for certain, but that is allocation free and fast).

A tuple is probably the wrong data structure if the length is only known at runtime. This is what a `Vector` is for.

3 Likes

Allocation may be unavoidable, but it seemed like it might be possible to avoid it because all values of the sequence are known, in a sense. They’re all determined by a single input.

So is the halting problem.

Sure, but we know exactly how and when this recurrance will halt.

You know, but the compiler does not. And the latter point is crucial for performance.

Well, is there a way to tell the compiler? The number of terms is known at runtime so one could use it for dispatch?

This is what I was trying above. In your case, it would be:

``````julia> function Tₖ(T::Tuple, ξ, ::Val{N}, ::Val{M}) where {N,M}
N == M && return T
tup = ntuple(N+1) do i
if i <= N
return T[i]
elseif i == N + 1
return 2*ξ*T[N] - T[N-1]
end
end
Tₖ(tup, ξ, Val(N+1), Val(M))
end
Tₖ (generic function with 1 method)

julia> using BenchmarkTools

julia> Tₖ((1,2),3,Val(2),Val(9))
(1, 2, 11, 64, 373, 2174, 12671, 73852, 430441)

julia> @btime Tₖ((1,2),3,Val(2),Val(9))
77.598 ns (5 allocations: 336 bytes)
(1, 2, 11, 64, 373, 2174, 12671, 73852, 430441)

``````

It does perform slightly better than your original code, apparently. But I could not get rid completely of the allocations because the return type of the tuple is unknown, causing a type instability.

``````julia> function Tₖ(T::Tuple, ξ, n::Int)
if length(T) == n
return T
else
return Tₖ((T..., 2*ξ*T[end] - T[end-1]), ξ, n)
end
end
Tₖ (generic function with 2 methods)

julia> @btime Tₖ((1,2),3,9)
130.979 ns (7 allocations: 464 bytes)
(1, 2, 11, 64, 373, 2174, 12671, 73852, 430441)

julia> @btime Tₖ((1,2),3,9)
108.886 ns (6 allocations: 416 bytes)
(1, 2, 11, 64, 373, 2174, 12671, 73852, 430441)

``````

Yes I was trying some similar approaches but never got results faster than simply allocating and filling a vector What about (disclaimer: I think the result is wrong, but that is a detail). Instead of the recursion, repeat the calculations. Calculations are much cheaper than allocations:

``````julia> function Tₖ(T::Tuple, ξ, ::Val{M}) where {M}
tup = ntuple(M) do i
if i <= length(T)
return T[i]
else
el2 = T[end-1]
el1 = 2*ξ*T[end]
for j in 1:i-length(T)
el2 = el1
el1 = 2*ξ*el2
end
return el1
end
end
return tup
end
Tₖ (generic function with 3 methods)

julia> Tₖ((1,2),3,Val(9))
(1, 2, 72, 432, 2592, 15552, 93312, 559872, 3359232)

julia> @btime Tₖ((1,2),3,Val(9))
8.022 ns (0 allocations: 0 bytes)
(1, 2, 72, 432, 2592, 15552, 93312, 559872, 3359232)

``````

That’s an interesting idea. I tried the following implementation

``````function Tk_(ξ, L::Int)
L == 1 && return(one(ξ))
L == 2 && return(ξ)
Tₖ₋₂ = one(ξ)
Tₖ₋₁ = ξ
Tₖ = zero(ξ)
for _ = 3:L
Tₖ = 2ξ*Tₖ₋₁ - Tₖ₋₂
Tₖ₋₂ = Tₖ₋₁
Tₖ₋₁ = Tₖ
end
return(Tₖ)
end

function Tk(ξ::U, ::Val{L})::NTuple{L,U} where {L,U}
ntuple(i->Tk_(ξ,i), L)
end
``````

BUT, weirdly, there is a huge performance dropoff when the number of elements exceeds 10:

``````using BenchmarkTools

julia> @btime Tk(0.1, Val(9))
7.900 ns (0 allocations: 0 bytes)

julia> @btime Tk(0.1, Val(10))
8.709 ns (0 allocations: 0 bytes)

julia> @btime Tk(0.1, Val(11))
426.633 ns (13 allocations: 448 bytes)

julia> @btime Tk(0.1, Val(12))
463.452 ns (14 allocations: 480 bytes)
``````

Suddenly, at >11 elements, allocations and a considerable slowdown. This is totally confusing to me.

1 Like

Maybe inlining the function helps? Just a guess.

It looks like it inlines everything until then, but reverts to a normal function call once it decides things are too big?

One way around is to make a generated function:

``````julia> @generated function Tkgen(T::Tuple, ξ, ::Val{M}) where {M}
tup = ntuple(M) do i
if i <= length(T.parameters)
return :(T[\$i])
else
el2 = :(T[end-1])
el1 = :(2*ξ*T[end])
for j in 1:i-length(T.parameters)
el2 = el1
el1 = :(2*ξ*\$el2)
end
return el1
end
end
return :((\$(tup...),))
end
Tkgen (generic function with 1 method)

julia> @btime Tkgen((1,2),3,Val(9))
0.001 ns (0 allocations: 0 bytes)
(1, 2, 72, 432, 2592, 15552, 93312, 559872, 3359232)

julia> @btime Tkgen((1,2),3,Val(15))
0.001 ns (0 allocations: 0 bytes)
(1, 2, 72, 432, 2592, 15552, 93312, 559872, 3359232, 20155392, 120932352, 725594112, 4353564672, 26121388032, 156728328192)

julia> @btime map(x -> Tkgen((1,2),x,Val(9)), \$(rand(1000)));  # a more realistic test
13.041 μs (2 allocations: 70.36 KiB)

julia> @btime map(x -> Tₖ((1,2),x,Val(9)), \$(rand(1000)));  # same speed
13.042 μs (2 allocations: 70.36 KiB)

julia> @btime map(x -> Tkgen((1,2),x,Val(15)), \$(rand(1000)));
15.666 μs (2 allocations: 117.23 KiB)
``````

Does a generated function work for you if the number of terms is a little larger? Say, 100? I gave it a try and had issues with larger numbers of terms.

What sort of issues?

I didn’t try, but 100 is very long for a tuple. For most purposes you will want to switch to a vector long before that; many Base functions switch at 16 or 32.

2 Likes

Critically slow. But you’re totally right. It’s getting pretty hairy and the solution is probably just to use vectors.

If anyone’s interested, the underlying issue has to do with doing Chebyshev interpolation in 2D. I had been using pre-allocated vectors for the chebyshev expansions, but pre-allocating strictly limits the type of the interpolation coordinate. Now I’m just dispatching on the coordinate type—using the pre-allocated vectors when possible and allocating otherwise.

The relevant methods are here.

1 Like I didn’t know that was being worked on!

I’m not sure why, but I’m generally getting ~2x better speed. Happy to share the benchmarking code.  Note the issue I was asking about in this post don’t apply in these tests. The interpolation is all allocation-free. For different coordinate types the operation is forced to allocate and slows down a little.

Sure, could you file an issue?