Allocations from selecting fields of a named tuple

There seems to be a large difference in the number allocations when selecting fields from a named tuple via two different approaches.

using BenchmarkTools
nt = (a=1, b=2, c=3)
t = (:b, :c)

If I use getindex, I get 6 allocations:

julia> @btime $nt[$t];
  478.969 ns (6 allocations: 224 bytes)

But if I use one of the NamedTuple constructors, I get zero allocations:

julia> @btime NamedTuple{(:b, :c)}($nt);
  2.551 ns (0 allocations: 0 bytes)

Does anyone know why there is a difference in the number of allocations between these two approaches?

(I also tried wrapping the nt[t] in a function, but it didn’t seem to make a difference.)

Version Info
julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 4 × Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

Seems to be due to constant-propagation:

  1. @less nt[t] shows that getindex is implemented as
    @inline getindex(t::NamedTuple, idxs::Tuple{Vararg{Symbol}}) = NamedTuple{idxs}(t)
    
  2. nt[(:b, :c)] – instead of nt[t] – does not allocate
    julia> @btime $nt[(:b, :c)];
       4.207 ns (0 allocations: 0 bytes)
    
    whereas the following also allocates:
    julia> @btime NamedTuple{$t}($nt);
       775.243 ns (6 allocations: 224 bytes)
    
3 Likes

If you wrap it all in a function, it doesn’t show any allocations neither

julia> function foo()
           nt = (a=1, b=2, c=3)
           t = (:b, :c)
           nt[t]
       end
foo (generic function with 1 method)

julia> @btime foo()
  0.791 ns (0 allocations: 0 bytes)
(b = 2, c = 3)
2 Likes

<1ns timings suggest the code was ran at compile-time or lifted out of the benchmark. It’s the former:

julia> @code_llvm foo()
;  @ REPL[14]:1 within `foo`
define void @julia_foo_896([2 x i64]* noalias nocapture sret([2 x i64]) %0) #0 {
top:
;  @ REPL[14]:4 within `foo`
  %1 = bitcast [2 x i64]* %0 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(16) %1, i8* noundef nonnull align 8 dereferenceable(16) bitcast ([2 x i64]* @_j_const1 to i8*), i64 16, i1 false)
  ret void
}

julia> @eval bar() = $(foo())
bar (generic function with 1 method)

julia> @code_lowered bar()
CodeInfo(
1 ─     return (b = 2, c = 3)
)

julia> @code_llvm bar()
;  @ REPL[20]:1 within `bar`
define void @julia_bar_920([2 x i64]* noalias nocapture sret([2 x i64]) %0) #0 {
top:
  %1 = bitcast [2 x i64]* %0 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(16) %1, i8* noundef nonnull align 8 dereferenceable(16) bitcast ([2 x i64]* @_j_const1 to i8*), i64 16, i1 false)
  ret void
}
1 Like

As @bertschi said, the difference is if it is constant folding or not.
What in particular the problem is, is unless the values of t are constant folded,
we can’t know the return type of the expression.
Since if t is (:a,) it is @NamedTuple{a::Int64} vs if it is (:a, :b) it is: @NamedTuple{a::Int64, b::Int64} etc.
And that means it is a type unstable expression and the compiler needs to allocate memory on the heap for this object of unknown size.
Where as if we do know the value, then it can be allocated on the stack (which doesn’t show up as an allocation at all).

This constant folding can be seen in @Albert_de_montserrat 's answer, because constant folding is only performed in functions – since the optimizer only runs in functions.
At that point though it also has a constant for the value on nt and so can just constant fold all the way to an answer. like @Benny said.

julia> function foo()
                 nt = (a=1, b=2, c=3)
                 t = (:b, :c)
                 nt[t]
             end
foo (generic function with 1 method)

julia> @code_llvm foo()
; Function Signature: foo()
;  @ REPL[1]:1 within `foo`
define void @julia_foo_1754([2 x i64]* noalias nocapture noundef nonnull sret([2 x i64]) align 8 dereferenceable(16) %sret_return) #0 {
top:
  %0 = bitcast [2 x i64]* %sret_return to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(16) %0, i8* noundef nonnull align 8 dereferenceable(16) bitcast ([2 x i64]* @"_j_const#1" to i8*), i64 16, i1 false)
  ret void
}
julia> using BenchmarkTools
Precompiling BenchmarkTools
  5 dependencies successfully precompiled in 11 seconds. 32 already precompiled.

julia> @btime foo()
  0.855 ns (0 allocations: 0 bytes)
(b = 2, c = 3)

If we pass it in as an interpolated value we can get the right timing:

julia> function bar(nt)
                 t = (:b, :c)
                 nt[t]
             end
bar (generic function with 1 method)

julia> @btime bar($(;a=1,b=2,c=3))
  1.498 ns (0 allocations: 0 bytes)
(b = 2, c = 3)
5 Likes