Why does a vector with 10 times more elements takes 2x-5x less time to pre-allocate?

I’m benchmarking pre-allocation of vectors:

display(@benchmark  Vector{Int}(undef, 1000) )
display(@benchmark  Vector{Int}(undef, 10000) )
display(@benchmark  Vector{Int}(undef, 1000) )
display(@benchmark  Vector{Int}(undef, 10000) )

I get these results on my MacBook Pro 2021, Ventura 13.6.2 and Julia 1.11.1


This is counterintuitive: pre-allocating the longer vector takes about 2 and 5 less the average and median time, respectively.

Why would this happen? And how can this be used for faster code?

I cannot replicate this (on Windows 10), so it’s certainly not a universal phenomenon.

Versioninfo (1.10.4)

Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 Γ— Intel(R) Coreβ„’ i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_NUM_THREADS = auto

1.10.4
julia> @benchmark Vector{Int}(undef, 1000)
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  122.634 ns …  38.501 ΞΌs  β”Š GC (min … max):  0.00% … 97.11%
 Time  (median):     175.823 ns               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   469.906 ns Β± 736.463 ns  β”Š GC (mean Β± Οƒ):  36.84% Β± 27.70%

  β–ˆβ–…β–ƒ        ▁▃▃▃▃▂▃▃▂▁▁                                        ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–ƒβ–ƒβ–ƒβ–β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–†β–†β–„β–„β–…β–„β–ƒβ–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–…β–„β–„β–„β–„β–…β–…β–„β–„β–„β–‡ β–ˆ
  123 ns        Histogram: log(frequency) by time       3.93 ΞΌs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

julia> @benchmark Vector{Int}(undef, 10000)
BenchmarkTools.Trial: 10000 samples with 196 evaluations.
 Range (min … max):  611.735 ns … 13.318 ΞΌs  β”Š GC (min … max):  0.00% … 88.09%
 Time  (median):     878.571 ns              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.689 ΞΌs Β±  1.234 ΞΌs  β”Š GC (mean Β± Οƒ):  40.63% Β± 29.03%

  β–…β–†β–‡β–ˆβ–†β–‚β–…β–‚β–‚β–           ▁        ▂▃▂▂▂▂▃▅▆▆▅▄▃▂▁                β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–…β–ƒβ–β–ƒβ–β–β–…β–…β–ˆβ–ˆβ–‡β–…β–†β–†β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–‡β–†β–†β–†β–…β–‡β–…β–… β–ˆ
  612 ns        Histogram: log(frequency) by time      4.57 ΞΌs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.
1.11.1
julia> @benchmark Vector{Int}(undef, 1000)
BenchmarkTools.Trial: 10000 samples with 964 evaluations.
 Range (min … max):  136.826 ns …   6.270 ΞΌs  β”Š GC (min … max):  0.00% … 83.47%
 Time  (median):     164.627 ns               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   438.752 ns Β± 484.669 ns  β”Š GC (mean Β± Οƒ):  43.07% Β± 33.43%

  β–ˆβ–ƒβ–‚β–      ▃▅▄▂▁▁▁            ▁▁▁                              ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–…β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–†β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–†β–†β–…β–†β–†β–…β–„β–…β–…β–…β–…β–„β–„β–†β–…β–†β–†β–†β–…β–…β–…β–…β–…β–„ β–ˆ
  137 ns        Histogram: log(frequency) by time       2.45 ΞΌs <

 Memory estimate: 7.88 KiB, allocs estimate: 3.

julia> @benchmark Vector{Int}(undef, 10000)
BenchmarkTools.Trial: 10000 samples with 178 evaluations.
 Range (min … max):  783.146 ns … 13.634 ΞΌs  β”Š GC (min … max):  0.00% … 74.92%
 Time  (median):       1.163 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     2.383 ΞΌs Β±  1.610 ΞΌs  β”Š GC (mean Β± Οƒ):  54.82% Β± 36.67%

  β–…β–‡β–‡β–ˆβ–†β–‚β–                       ▂▆▆▆▆▅▄▃▂▁▁▁  ▁                β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–β–…β–…β–„β–„β–ˆβ–ˆβ–†β–†β–…β–„β–„β–β–β–β–„β–β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–ˆβ–‡β–†β–†β–†β–†β–†β–…β–… β–ˆ
  783 ns        Histogram: log(frequency) by time      6.38 ΞΌs <

 Memory estimate: 78.19 KiB, allocs estimate: 3.

I suppose you could just allocate too much and then take a view.

x = Vector{Int}(undef, alloc_length)  # e.g. 10_000
x = view(x, 1:desired_length)  # e.g. 1_000
2 Likes

this has to do with the speed of malloc vs your system allocator. note that this benchmark may be misleading since you might be ending up in a place where you are allocating, running a trivial GC and then freeing, where a more realistic workload wouldn’t show this behavior

3 Likes

It’s useful to know about OS dependence, thanks for checking. Interestingly, allocs estimate = 3 in either of my cases, whereas for you it’s 1 and 2 for the smaller and larger case.

The view solution is indeed faster than direct allocation for 1000, and almost as fast as for 10,000. Also, it’s about 7 times faster than resize!, so I’ll try it elsewhere in my code.

EDIT: Actually, the resize! speed could be a regression in Julia 1.9.4 β†’ 1.11.1.

x = Vector{Int}(undef, 10000)
y =   Vector{Int}(undef, 10000)
display(@benchmark view($x, 1:1000))
display(@benchmark  resize!($y, 1000))

On 1.11.1 this outputs

view:
Range (min … max):  2.416 ns … 15.834 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.542 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.544 ns Β±  0.168 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
resize!
 Range (min … max):  3.958 ns … 17.916 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.042 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.085 ns Β±  0.304 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

And on 1.9.4:

view: 
Range (min … max):  1.791 ns … 26.500 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.875 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.910 ns Β±  0.360 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
resize!:
 Range (min … max):  1.791 ns … 24.166 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.875 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.915 ns Β±  0.405 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

Thanks, I’ll benchmark this with my actual (realistic) code. Do you have some pointers which OS/hardware/GC parameters could be relevant here? Perhaps I could adjust vector sizes programmatically, based on those.

This benchmark result is very weird and probably misleading. resize! should also be basically a zero-cost operation. How can it be slower than, well, anything?

Note that this is only the case on 1.10.4, in 1.11.1 I also got 3 allocations for both sizes.

This doesn’t seem to be true in 1.11: resize! essentially just calls _deleteend!, which is implemented as

function _deleteend!(a::Vector, delta::Integer)
    delta = Int(delta)
    len = length(a)
    0 <= delta <= len || throw(ArgumentError("_deleteend! requires delta in 0:length(a)"))
    newlen = len - delta
    for i in newlen+1:len
        @inbounds _unsetindex!(a, i)
    end
    setfield!(a, :size, (newlen,))
    return
end

(while in earlier versions it’s a ccall). So apart from changing the size attribute, it is also explicitly looping over all β€˜deleted’ elements.

2 Likes

That’s shocking. The whole point of resize! is that it should be zero cost (or O(1)). What is the point of deleting the elements?

the intent, at least is that the entire loop should disappear for common eltypes. it needs to be there (and was there in C) because you don’t want deleted memory in the array keeping alive data since otherwise that would be a horrible way to have nasty memory leaks. for bitstypes it (hopefully) codegens into nothing

But benchmarks (I tried some myself) showed O(n) behavior for Float64. Is that unexpected?

Windows 11, Julia v1.11.1, and I also see the consistently shorter minimum, median, and average times for allocating the smaller vector.

I got a little more reasonable result for Julia 1.9.4. Running this code

function allocation(a::Int=1000, b::Int=10_000)
    println("Allocation")
    println(a)
    display(@benchmark Vector{Int}(undef, $a))
    println(b)
    display(@benchmark Vector{Int}(undef, $b))
    println(a)
    display(@benchmark Vector{Int}(undef, $a))
    println(b)
    display(@benchmark Vector{Int}(undef, $b))
    println("---------------------------")
end
allocation()

after a few times outputs


The median time is smaller for the smaller vector, whereas the average is still larger. However, for Julia 1.11.1, the outputs are the same as in my OP, even after several runs.