Variable Length Tuples Without Allocation

mrufsvold · March 18, 2025, 7:12pm

A quick search of this forum yields a number of questions about variable length Tuples. Maybe you have a recursive function that returns a tuple based on depth or something like that.

I have bumped into this issue several times where I know that a tuple will have some reasonable number of elements, but not precisely how many. It is my intuition that I should be able to tell the compiler “Please allocate N number of slots on the stack, even though I might not use them all, and I’ll only getindex on the one’s I’ve populated.”

I wrote a proof of concept that I think achieves this behavior (see below). However, I don’t understand benchmarks well enough to know if my limited tests are just being smartly compiled away or if I am actually achieving allocation free code.

I’d appreciate your feedback on the implementation, bad edge case tests, etc. I’m also wondering, if this does actually work, if there is a package this might fit within? I don’t really want to maintain a package myself, but I’d be happy to contribute to someone else’s thing if they’re up for it.

module UpToNTuples
using Base: issingletontype

struct UpToNTuple{N, T}
    data::NTuple{N, T}
    len::UInt8
    function UpToNTuple(::Val{N}, ::Type{T}, data::NTuple{N,T}, len::UInt8) where {N, T}
        return new{N, T}(data, len)
    end
end

## A bunch of constructors to add some flexibility
# handle empty case
UpToNTuple(; _...) = throw(ArgumentError("Must provide eltype for an empty `UpToNTuple` like `UpToNTuple{N,T}` or `UpToNTuple{T}`"))
function UpToNTuple{X}(; kwargs...) where X
    X isa Integer && throw(ArgumentError("Must provide eltype for an empty `UpToNTuple` like `UpToNTuple{N,T}` or `UpToNTuple{T}`"))
    return _UpToNTuple(Val(10), X, (); kwargs...)
end
UpToNTuple{N,T}(; kwargs...) where {N,T} = UpToNTuple{N, T}(Val(N), T, (); kwargs...)

# handle cases with elements
UpToNTuple(data::T; ks...) where {T} = _UpToNTuple(Val(10), eltype(T), data; ks...)
function UpToNTuple{X}(data::T; ks...) where {X, T}
    Eltype = X isa Integer ? eltype : X
    N = X isa Integer ? X : 10
    _UpToNTuple(Val(N), Eltype, data; ks...)
end
UpToNTuple{N,T}(data; ks...) where {N,T} = _UpToNTuple(Val(N), T, data; ks...)

# special case handling for getting default value for empty tuples
function _UpToNTuple(::Val{N}, ::Type{E}, data::T; kwargs...) where {T, E, N}
    len = UInt8(length(data))
    N > 10 && throw(ArgumentError("Can't construct an UpToNTuple with more than 10 slots"))
    len > N && throw(ArgumentError("Too many arguments for UpToNTuple{$N}"))

    default_value = if haskey(kwargs, :default_value)
        kwargs[:default_value]
    elseif iszero(len)
        default_value = empty_value(E)
    else
        default_value = empty_value(first(data))
    end::E


    d = ntuple(i -> (i <= len ? data[i] : default_value)::E, N)
    return UpToNTuple(Val(N), E, d, len)
end

#### constructor helpers
"""
    empty_value(x::T)
Return a default empty value for for type `T`.
"""
empty_value(x::T) where T = isbitstype(T) ? x : empty_value(T)
empty_value(::Type{Symbol}) = Symbol("")
empty_value(::Type{T}) where {T<:AbstractString} = T("")
empty_value(::Type{T}) where {T<:Number} = zero(T)
empty_value(::Type{T}) where T = issingletontype(T) ? T() : error("No default empty value for type $T")

# Basic Tuple API
Base.length(u::UpToNTuple) = u.len
Base.eachindex(u::UpToNTuple) = Base.OneTo(u.len)
function Base.getindex(u::UpToNTuple{<:Any, T}, i::Int) where T
    i > length(u) && throw(BoundsError(u, i))
    return u.data[i]::T
end
function Base.iterate(u::UpToNTuple, state=nothing)
    isnothing(state) && return (u.data[1], 2)
    i = state
    i > length(u) && return nothing
    return (u.data[i], i + 1)
end
function push!!(n::UpToNTuple{N, T}, x::T2) where {N, T, T2 <: T}
    i = length(n) + 1
    i > N && throw(ArgumentError("Too many elements for UpToNTuple{$N}"))
    data = ntuple(N) do j
        (j == i ? x : n.data[j])::T
    end
    return UpToNTuple(Val(N), T, data, UInt8(i))::UpToNTuple{N, T}
end
function Base.show(io::IO, x::UpToNTuple{N,T}) where {N, T}
    print(io, "UpToNTuple{", N, ", ", T, "}(")
    for i in 1:length(x)
        show(io, x[i])
        i < length(x) && print(io, ", ")
    end
    print(io, ")")
end
end

Benchmarks

using Chairmarks

f(a, b, c) = UpToNTuples.UpToNTuple((a, b, c))[1];
# I don't see any allocations in the output
@code_native f(1,2,3)

function rec(::Val{N}) where N
    numbers = UpToNTuples.UpToNTuple{N, Int}(N)
    return rec(N-1, numbers)::UpToNTuples.UpToNTuple{N, Int}
end
function rec(n, numbers)
    n == 0 && return numbers
    numbers = UpToNTuples.push!!(numbers, n)
    return rand((true, false)) ? rec(n-1, numbers) : numbers
end

f() = map(_-> rec(Val(10)), 1:1000)

@b f()
# 19.500 μs (3 allocs: 86.000 KiB)

Thanks to @jakobnissen , @Mason , and @gbaraldi for your help on the slack thread.

Edit: I learned that Tuples allocate after 10 elements, so I set an upper limit to 10. I also improved the benchmark to a slightly more realistic case.

Edit2: Change to API – remove splatting of elements in favor of accepting a single iterator in the constructor. This follows the NTuple API more closely.

Octogonapus · March 18, 2025, 9:36pm

This seems like a good use for a generated function, but only if your input size is well known.

@generated function recg(::Val{N}) where {N}
    return quote
        if N == 0
            return ()
        end
        return rand((true, false)) ? (N, recg(Val(N - 1))...) : ()
    end
end

In practice, I doubt you will have something like that rand call, so it can be optimized very well (this is without the rand call):

julia_recg_5351:                        # @julia_recg_5351
; Function Signature: recg(Base.Val{3})
; ┌ @ REPL[42]:1 within `recg`
# %bb.0:                                # %top
        push    rbp
        mov     rbp, rsp
        mov     rax, rdi
        movabs  rcx, offset ".L_j_const#1"
        vmovups xmm0, xmmword ptr [rcx]
        vmovups xmmword ptr [rdi], xmm0
        mov     qword ptr [rdi + 16], 1
        pop     rbp
        ret
.Lfunc_end0:
        .size   julia_recg_5351, .Lfunc_end0-julia_recg_5351
; └
                                        # -- End function
        .type   ".L+Core.Tuple#5353",@object    # @"+Core.Tuple#5353"
        .section        .rodata,"a",@progbits
        .p2align        3, 0x0
".L+Core.Tuple#5353":
        .quad   ".L+Core.Tuple#5353.jit"
        .size   ".L+Core.Tuple#5353", 8

        .type   ".L_j_const#1",@object          # @"_j_const#1"
        .p2align        3, 0x0
".L_j_const#1":
        .quad   3                               # 0x3
        .quad   2                               # 0x2
        .quad   1                               # 0x1
        .size   ".L_j_const#1", 24

matthias314 · March 18, 2025, 9:39pm

This looks very much like SmallVector{N,Int} from my package SmallCollections.jl, see here. In master there is also a MutableSmallVector, see here.

The API is slightly different:

julia> v = SmallVector{5,Int}((5,)); push(v, 4)
2-element SmallVector{5, Int64}:
 5
 4

mrufsvold · March 18, 2025, 10:34pm

I saw your package, @matthias314 , was inspired by it, promptly forgot, and subconsciously plagiarized you!

Thanks for the reminder.

Benny · March 18, 2025, 10:37pm

Do you mean somewhere in your code? AFAIK Tuple construction doesn’t do that in general:

julia> isbits(@btime SVector{1600, Int}($(1:1600)))
  198.542 ns (0 allocations: 0 bytes)
true

julia> dump(SVector{1600, Int})
SVector{1600, Int64} <: StaticArray{Tuple{1600}, Int64, 1}
  data::NTuple{1600, Int64}

julia> sizeof(zeros(SVector{1600, Int},2)) # inlined, 1600*8*2 == 25600
25600

matthias314 · March 18, 2025, 10:44pm

The standard ntuple (without Val argument) allocates for longer tuples:

julia> f(::Val{N}) where N = ntuple(k -> 2*k, N);

julia> @b f(Val(10))
1.319 ns

julia> @b f(Val(11))
455.254 ns (3 allocs: 240 bytes)

sadish-d · March 26, 2025, 2:28am

I think the threshold of length 10 is due to how ntuple is defined:

@inline function ntuple(f::F, n::Integer) where F
    # marked inline since this benefits from constant propagation of `n`
    t = n == 0  ? () :
        n == 1  ? (f(1),) :
        n == 2  ? (f(1), f(2)) :
        n == 3  ? (f(1), f(2), f(3)) :
        n == 4  ? (f(1), f(2), f(3), f(4)) :
        n == 5  ? (f(1), f(2), f(3), f(4), f(5)) :
        n == 6  ? (f(1), f(2), f(3), f(4), f(5), f(6)) :
        n == 7  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7)) :
        n == 8  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8)) :
        n == 9  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9)) :
        n == 10 ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9), f(10)) :
        _ntuple(f, n)
    return t
end

after length 10, ntuple calls _ntuple.

PatrickHaecker · March 26, 2025, 5:25am

Good catch!
Directly calling Base._ntuple also allocates:

julia> @allocated Base._ntuple(identity, 3)
112

Base._ntuple uses an array comprehension just to immediately splat it again:

function _ntuple(f::F, n::Int) where F
    @noinline
    (n >= 0) || throw(ArgumentError(LazyString("tuple length should be ≥ 0, got ", n)))
    ([f(i) for i = 1:n]...,)
end

When using a generator instead, no allocations happen:

julia> function _ntuple(f::F, n) where F
    @noinline
    (n >= 0) || throw(ArgumentError(LazyString("tuple length should be ≥ 0, got ", n)))
    ((f(i) for i = 1:n)...,)
end

julia> @allocated _ntuple(identity, 30)
0

My first impulse was to immediately create a PR, but I thought it extremely unlikely that one of the core data structures misses such a low-hanging optimization, especially after checking who was the last one to touch this line.

So can someone please explain why we need the comprehension here, when the generator would end up with seemingly the same result without allocations?

Or is this really just some artefact from a different Julia time and no-one revisited this since then?

nsajko · March 26, 2025, 11:48am

It’s possible the change you propose would be an improvement. It’s also possible for it to be rejected because of obscure reasons. Keep in mind, though, I think people usually use the Val method when n is large, ntuple(f, Val(n)) instead of ntuple(f, n). At least that’s the established practice.

PatrickHaecker · March 27, 2025, 3:26pm

Interestingly, with Chairmarks.jl the story seems to get a twist:

While the version with the generator is faster for constant input, it might actually be slower for non-constant input and allocate more (pay attention to the units, to the FQN and to the interpolation)

julia> @b _ntuple(identity, 30)
1.251 ns

julia> @b Base._ntuple(identity, 30)
510.395 ns (3 allocs: 560 bytes)

julia> @b _ntuple(identity, $30)
1.035 μs (33 allocs: 1.609 KiB)

julia> @b Base._ntuple(identity, $30)
507.882 ns (3 allocs: 560 bytes)

This would fit to the explanation, that everyone with a constant length should use the Val method.

Topic		Replies	Views
Efficient recursive tuple construction General Usage question , tuple , recursion	19	1165	September 10, 2021
How to create fixed-length tuple without heap allocation General Usage	2	1018	April 9, 2018
Core.tuple() warntype Performance	22	1580	January 17, 2020
Type unstable function returning a tuple General Usage	34	1442	November 3, 2020
Is it possible to get allocations from 1 to 0 in this code? General Usage	12	652	October 19, 2021

Variable Length Tuples Without Allocation

Related topics