Why do large NTuples Allocate?

I was surprised to find that generating Tuples via generators causes allocations for large generators…

Using BenchmarkTools 

@btime NTuple{5,Float64}(1.0 for i=1:5)
  1.500 ns (0 allocations: 0 bytes)

@btime NTuple{50,Float64}(1.0 for i=1:50)
  1.633 μs (52 allocations: 1.70 KiB)

 @btime NTuple{500,Float64}(1.0 for i=1:500)
  15.500 μs (502 allocations: 15.88 KiB)

I would’ve thought each of the cases would not heap-allocate as the compiler knows both the length of the Tuple as well as the type of its elements. I am hoping to use this for a real-time processing application and ideally I would like 0 heap-allocations during processing.

I am using Julia 1.6.3 on an M1-MBP.

Thanks!

I’m not sure if I can answer the “why”, but for large collections as those the best option is to use preallocated vectors.

There is work in progress to allow for that pattern to be used more often: https://github.com/JuliaLang/julia/pull/41777

Thank you for your reply! Yes, currently, I have been using StaticArrays (specifically SVectors), but the documentation mentions that they are not useful for vectors of length > 100. This is quite a restriction for my application.

Indeed, static arrays are wrappers around tuples, so the problem is the same. The standard way to deal with large collections is preallocation.

For large arrays, you are better off with Array anyway. (The whole point of StaticArray is to inline and unroll code for small vectors — this is counterproductive if the vectors are not small.)

1 Like

I see, though isn’t Array heap-allocated? I am/was under the impression it can never be pre-allocated on the stack. For real-time purposes, heap-allocation isn’t ideal.

Yes, array is heap-allocated, but the problem is having arrays being heap allocated and garbage collected all the time in functions critical for performance, and loops. If you preallocate the arrays and just use them in the critical parts of the code, that is fine.

Playing a little. In the example below, the variable under study is the aux array, which in one case will be a static array, and in the other just a standard array which is preallocated.

Using a static array, using a trick to allow the compiler to optimize for the size of the array to be generated, which is a variable of the function:

julia> using StaticArrays
       function g(x,::Val{N}) where N
           aux = zeros(SVector{N,Float64})
           for i in 1:length(x)-N # critical loop
               aux += @view(x[i:i+N-1])
           end
           return sum(aux)
       end
g (generic function with 3 methods)

julia> @btime g($x,Val(10));
  25.066 μs (0 allocations: 0 bytes)

julia> @btime g($x,Val(100));
  74.327 μs (0 allocations: 0 bytes)

julia> @btime g($x,Val(1000));
  848.970 μs (0 allocations: 0 bytes)

julia> @btime g($x,Val(5000));
  5.077 ms (0 allocations: 0 bytes)

So far so good, no allocations happening anywhere, but you will note the huge compiler time taken for the largest values of N.

Now let us simplify things and just us an allocated aux:

julia> function f(x,n)
           aux = zeros(n)
           for i in 1:length(x)-n # critical loop
               aux .+= @view(x[i:i+n-1])
           end
           return sum(aux)
       end
f (generic function with 2 methods)

julia> @btime f($x,10);
  87.694 μs (1 allocation: 144 bytes)

julia> @btime f($x,100);
  137.613 μs (1 allocation: 896 bytes)

julia> @btime f($x,1000);
  956.585 μs (1 allocation: 7.94 KiB)

julia> @btime f($x,5000);
  3.895 ms (2 allocations: 39.11 KiB)

There are allocations now (of course, because aux is being allocated). There is an important performance difference for small sizes, but for larger sizes the performance of the heap allocated array can be better (in this example the tradeoff is above 1000, but that is not typical I think). There is no compilation overhead in this second case either.

2 Likes