Variable Length Tuples Without Allocation

A quick search of this forum yields a number of questions about variable length Tuples. Maybe you have a recursive function that returns a tuple based on depth or something like that.

I have bumped into this issue several times where I know that a tuple will have some reasonable number of elements, but not precisely how many. It is my intuition that I should be able to tell the compiler “Please allocate N number of slots on the stack, even though I might not use them all, and I’ll only getindex on the one’s I’ve populated.”

I wrote a proof of concept that I think achieves this behavior (see below). However, I don’t understand benchmarks well enough to know if my limited tests are just being smartly compiled away or if I am actually achieving allocation free code.

I’d appreciate your feedback on the implementation, bad edge case tests, etc. I’m also wondering, if this does actually work, if there is a package this might fit within? I don’t really want to maintain a package myself, but I’d be happy to contribute to someone else’s thing if they’re up for it.

module UpToNTuples
using Base: issingletontype

struct UpToNTuple{N, T}
    data::NTuple{N, T}
    len::UInt8
    function UpToNTuple(::Val{N}, ::Type{T}, data::NTuple{N,T}, len::UInt8) where {N, T}
        return new{N, T}(data, len)
    end
end

## A bunch of constructors to add some flexibility
# handle empty case
UpToNTuple(; _...) = throw(ArgumentError("Must provide eltype for an empty `UpToNTuple` like `UpToNTuple{N,T}` or `UpToNTuple{T}`"))
function UpToNTuple{X}(; kwargs...) where X
    X isa Integer && throw(ArgumentError("Must provide eltype for an empty `UpToNTuple` like `UpToNTuple{N,T}` or `UpToNTuple{T}`"))
    return _UpToNTuple(Val(10), X, (); kwargs...)
end
UpToNTuple{N,T}(; kwargs...) where {N,T} = UpToNTuple{N, T}(Val(N), T, (); kwargs...)

# handle cases with elements
UpToNTuple(data::T; ks...) where {T} = _UpToNTuple(Val(10), eltype(T), data; ks...)
function UpToNTuple{X}(data::T; ks...) where {X, T}
    Eltype = X isa Integer ? eltype : X
    N = X isa Integer ? X : 10
    _UpToNTuple(Val(N), Eltype, data; ks...)
end
UpToNTuple{N,T}(data; ks...) where {N,T} = _UpToNTuple(Val(N), T, data; ks...)

# special case handling for getting default value for empty tuples
function _UpToNTuple(::Val{N}, ::Type{E}, data::T; kwargs...) where {T, E, N}
    len = UInt8(length(data))
    N > 10 && throw(ArgumentError("Can't construct an UpToNTuple with more than 10 slots"))
    len > N && throw(ArgumentError("Too many arguments for UpToNTuple{$N}"))

    default_value = if haskey(kwargs, :default_value)
        kwargs[:default_value]
    elseif iszero(len)
        default_value = empty_value(E)
    else
        default_value = empty_value(first(data))
    end::E


    d = ntuple(i -> (i <= len ? data[i] : default_value)::E, N)
    return UpToNTuple(Val(N), E, d, len)
end

#### constructor helpers
"""
    empty_value(x::T)
Return a default empty value for for type `T`.
"""
empty_value(x::T) where T = isbitstype(T) ? x : empty_value(T)
empty_value(::Type{Symbol}) = Symbol("")
empty_value(::Type{T}) where {T<:AbstractString} = T("")
empty_value(::Type{T}) where {T<:Number} = zero(T)
empty_value(::Type{T}) where T = issingletontype(T) ? T() : error("No default empty value for type $T")

# Basic Tuple API
Base.length(u::UpToNTuple) = u.len
Base.eachindex(u::UpToNTuple) = Base.OneTo(u.len)
function Base.getindex(u::UpToNTuple{<:Any, T}, i::Int) where T
    i > length(u) && throw(BoundsError(u, i))
    return u.data[i]::T
end
function Base.iterate(u::UpToNTuple, state=nothing)
    isnothing(state) && return (u.data[1], 2)
    i = state
    i > length(u) && return nothing
    return (u.data[i], i + 1)
end
function push!!(n::UpToNTuple{N, T}, x::T2) where {N, T, T2 <: T}
    i = length(n) + 1
    i > N && throw(ArgumentError("Too many elements for UpToNTuple{$N}"))
    data = ntuple(N) do j
        (j == i ? x : n.data[j])::T
    end
    return UpToNTuple(Val(N), T, data, UInt8(i))::UpToNTuple{N, T}
end
function Base.show(io::IO, x::UpToNTuple{N,T}) where {N, T}
    print(io, "UpToNTuple{", N, ", ", T, "}(")
    for i in 1:length(x)
        show(io, x[i])
        i < length(x) && print(io, ", ")
    end
    print(io, ")")
end
end

Benchmarks

using Chairmarks

f(a, b, c) = UpToNTuples.UpToNTuple((a, b, c))[1];
# I don't see any allocations in the output
@code_native f(1,2,3)

function rec(::Val{N}) where N
    numbers = UpToNTuples.UpToNTuple{N, Int}(N)
    return rec(N-1, numbers)::UpToNTuples.UpToNTuple{N, Int}
end
function rec(n, numbers)
    n == 0 && return numbers
    numbers = UpToNTuples.push!!(numbers, n)
    return rand((true, false)) ? rec(n-1, numbers) : numbers
end

f() = map(_-> rec(Val(10)), 1:1000)

@b f()
# 19.500 μs (3 allocs: 86.000 KiB)

Thanks to @jakobnissen , @Mason , and @gbaraldi for your help on the slack thread.

Edit: I learned that Tuples allocate after 10 elements, so I set an upper limit to 10. I also improved the benchmark to a slightly more realistic case.

Edit2: Change to API – remove splatting of elements in favor of accepting a single iterator in the constructor. This follows the NTuple API more closely.

1 Like

This seems like a good use for a generated function, but only if your input size is well known.

@generated function recg(::Val{N}) where {N}
    return quote
        if N == 0
            return ()
        end
        return rand((true, false)) ? (N, recg(Val(N - 1))...) : ()
    end
end

In practice, I doubt you will have something like that rand call, so it can be optimized very well (this is without the rand call):

julia_recg_5351:                        # @julia_recg_5351
; Function Signature: recg(Base.Val{3})
; ┌ @ REPL[42]:1 within `recg`
# %bb.0:                                # %top
        push    rbp
        mov     rbp, rsp
        mov     rax, rdi
        movabs  rcx, offset ".L_j_const#1"
        vmovups xmm0, xmmword ptr [rcx]
        vmovups xmmword ptr [rdi], xmm0
        mov     qword ptr [rdi + 16], 1
        pop     rbp
        ret
.Lfunc_end0:
        .size   julia_recg_5351, .Lfunc_end0-julia_recg_5351
; └
                                        # -- End function
        .type   ".L+Core.Tuple#5353",@object    # @"+Core.Tuple#5353"
        .section        .rodata,"a",@progbits
        .p2align        3, 0x0
".L+Core.Tuple#5353":
        .quad   ".L+Core.Tuple#5353.jit"
        .size   ".L+Core.Tuple#5353", 8

        .type   ".L_j_const#1",@object          # @"_j_const#1"
        .p2align        3, 0x0
".L_j_const#1":
        .quad   3                               # 0x3
        .quad   2                               # 0x2
        .quad   1                               # 0x1
        .size   ".L_j_const#1", 24

This looks very much like SmallVector{N,Int} from my package SmallCollections.jl, see here. In master there is also a MutableSmallVector, see here.

The API is slightly different:

julia> v = SmallVector{5,Int}((5,)); push(v, 4)
2-element SmallVector{5, Int64}:
 5
 4
1 Like

I saw your package, @matthias314 , was inspired by it, promptly forgot, and subconsciously plagiarized you!

Thanks for the reminder.

Do you mean somewhere in your code? AFAIK Tuple construction doesn’t do that in general:

julia> isbits(@btime SVector{1600, Int}($(1:1600)))
  198.542 ns (0 allocations: 0 bytes)
true

julia> dump(SVector{1600, Int})
SVector{1600, Int64} <: StaticArray{Tuple{1600}, Int64, 1}
  data::NTuple{1600, Int64}

julia> sizeof(zeros(SVector{1600, Int},2)) # inlined, 1600*8*2 == 25600
25600
1 Like

The standard ntuple (without Val argument) allocates for longer tuples:

julia> f(::Val{N}) where N = ntuple(k -> 2*k, N);

julia> @b f(Val(10))
1.319 ns

julia> @b f(Val(11))
455.254 ns (3 allocs: 240 bytes)
4 Likes

I think the threshold of length 10 is due to how ntuple is defined:

@inline function ntuple(f::F, n::Integer) where F
    # marked inline since this benefits from constant propagation of `n`
    t = n == 0  ? () :
        n == 1  ? (f(1),) :
        n == 2  ? (f(1), f(2)) :
        n == 3  ? (f(1), f(2), f(3)) :
        n == 4  ? (f(1), f(2), f(3), f(4)) :
        n == 5  ? (f(1), f(2), f(3), f(4), f(5)) :
        n == 6  ? (f(1), f(2), f(3), f(4), f(5), f(6)) :
        n == 7  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7)) :
        n == 8  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8)) :
        n == 9  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9)) :
        n == 10 ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9), f(10)) :
        _ntuple(f, n)
    return t
end

after length 10, ntuple calls _ntuple.

3 Likes

Good catch!
Directly calling Base._ntuple also allocates:

julia> @allocated Base._ntuple(identity, 3)
112

Base._ntuple uses an array comprehension just to immediately splat it again:

function _ntuple(f::F, n::Int) where F
    @noinline
    (n >= 0) || throw(ArgumentError(LazyString("tuple length should be ≥ 0, got ", n)))
    ([f(i) for i = 1:n]...,)
end

When using a generator instead, no allocations happen:

julia> function _ntuple(f::F, n) where F
    @noinline
    (n >= 0) || throw(ArgumentError(LazyString("tuple length should be ≥ 0, got ", n)))
    ((f(i) for i = 1:n)...,)
end
julia> @allocated _ntuple(identity, 30)
0

My first impulse was to immediately create a PR, but I thought it extremely unlikely that one of the core data structures misses such a low-hanging optimization, especially after checking who was the last one to touch this line.

So can someone please explain why we need the comprehension here, when the generator would end up with seemingly the same result without allocations?

Or is this really just some artefact from a different Julia time and no-one revisited this since then?

1 Like

It’s possible the change you propose would be an improvement. It’s also possible for it to be rejected because of obscure reasons. Keep in mind, though, I think people usually use the Val method when n is large, ntuple(f, Val(n)) instead of ntuple(f, n). At least that’s the established practice.

2 Likes

Interestingly, with Chairmarks.jl the story seems to get a twist:

While the version with the generator is faster for constant input, it might actually be slower for non-constant input and allocate more (pay attention to the units, to the FQN and to the interpolation)

julia> @b _ntuple(identity, 30)
1.251 ns

julia> @b Base._ntuple(identity, 30)
510.395 ns (3 allocs: 560 bytes)

julia> @b _ntuple(identity, $30)
1.035 μs (33 allocs: 1.609 KiB)

julia> @b Base._ntuple(identity, $30)
507.882 ns (3 allocs: 560 bytes)

This would fit to the explanation, that everyone with a constant length should use the Val method.