Why does a vector with 10 times more elements takes 2x-5x less time to pre-allocate?

I’m benchmarking pre-allocation of vectors:

display(@benchmark  Vector{Int}(undef, 1000) )
display(@benchmark  Vector{Int}(undef, 10000) )
display(@benchmark  Vector{Int}(undef, 1000) )
display(@benchmark  Vector{Int}(undef, 10000) )

I get these results on my MacBook Pro 2021, Ventura 13.6.2 and Julia 1.11.1


This is counterintuitive: pre-allocating the longer vector takes about 2 and 5 less the average and median time, respectively.

Why would this happen? And how can this be used for faster code?

I cannot replicate this (on Windows 10), so it’s certainly not a universal phenomenon.

Versioninfo (1.10.4)

Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 Γ— Intel(R) Coreβ„’ i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_NUM_THREADS = auto

1.10.4
julia> @benchmark Vector{Int}(undef, 1000)
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  122.634 ns …  38.501 ΞΌs  β”Š GC (min … max):  0.00% … 97.11%
 Time  (median):     175.823 ns               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   469.906 ns Β± 736.463 ns  β”Š GC (mean Β± Οƒ):  36.84% Β± 27.70%

  β–ˆβ–…β–ƒ        ▁▃▃▃▃▂▃▃▂▁▁                                        ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–ƒβ–ƒβ–ƒβ–β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–†β–†β–„β–„β–…β–„β–ƒβ–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–…β–„β–„β–„β–„β–…β–…β–„β–„β–„β–‡ β–ˆ
  123 ns        Histogram: log(frequency) by time       3.93 ΞΌs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

julia> @benchmark Vector{Int}(undef, 10000)
BenchmarkTools.Trial: 10000 samples with 196 evaluations.
 Range (min … max):  611.735 ns … 13.318 ΞΌs  β”Š GC (min … max):  0.00% … 88.09%
 Time  (median):     878.571 ns              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.689 ΞΌs Β±  1.234 ΞΌs  β”Š GC (mean Β± Οƒ):  40.63% Β± 29.03%

  β–…β–†β–‡β–ˆβ–†β–‚β–…β–‚β–‚β–           ▁        ▂▃▂▂▂▂▃▅▆▆▅▄▃▂▁                β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–…β–ƒβ–β–ƒβ–β–β–…β–…β–ˆβ–ˆβ–‡β–…β–†β–†β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–‡β–†β–†β–†β–…β–‡β–…β–… β–ˆ
  612 ns        Histogram: log(frequency) by time      4.57 ΞΌs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.
1.11.1
julia> @benchmark Vector{Int}(undef, 1000)
BenchmarkTools.Trial: 10000 samples with 964 evaluations.
 Range (min … max):  136.826 ns …   6.270 ΞΌs  β”Š GC (min … max):  0.00% … 83.47%
 Time  (median):     164.627 ns               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   438.752 ns Β± 484.669 ns  β”Š GC (mean Β± Οƒ):  43.07% Β± 33.43%

  β–ˆβ–ƒβ–‚β–      ▃▅▄▂▁▁▁            ▁▁▁                              ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–…β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–†β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–†β–†β–…β–†β–†β–…β–„β–…β–…β–…β–…β–„β–„β–†β–…β–†β–†β–†β–…β–…β–…β–…β–…β–„ β–ˆ
  137 ns        Histogram: log(frequency) by time       2.45 ΞΌs <

 Memory estimate: 7.88 KiB, allocs estimate: 3.

julia> @benchmark Vector{Int}(undef, 10000)
BenchmarkTools.Trial: 10000 samples with 178 evaluations.
 Range (min … max):  783.146 ns … 13.634 ΞΌs  β”Š GC (min … max):  0.00% … 74.92%
 Time  (median):       1.163 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     2.383 ΞΌs Β±  1.610 ΞΌs  β”Š GC (mean Β± Οƒ):  54.82% Β± 36.67%

  β–…β–‡β–‡β–ˆβ–†β–‚β–                       ▂▆▆▆▆▅▄▃▂▁▁▁  ▁                β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–β–…β–…β–„β–„β–ˆβ–ˆβ–†β–†β–…β–„β–„β–β–β–β–„β–β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–ˆβ–‡β–†β–†β–†β–†β–†β–…β–… β–ˆ
  783 ns        Histogram: log(frequency) by time      6.38 ΞΌs <

 Memory estimate: 78.19 KiB, allocs estimate: 3.

I suppose you could just allocate too much and then take a view.

x = Vector{Int}(undef, alloc_length)  # e.g. 10_000
x = view(x, 1:desired_length)  # e.g. 1_000
2 Likes

this has to do with the speed of malloc vs your system allocator. note that this benchmark may be misleading since you might be ending up in a place where you are allocating, running a trivial GC and then freeing, where a more realistic workload wouldn’t show this behavior

4 Likes

It’s useful to know about OS dependence, thanks for checking. Interestingly, allocs estimate = 3 in either of my cases, whereas for you it’s 1 and 2 for the smaller and larger case.

The view solution is indeed faster than direct allocation for 1000, and almost as fast as for 10,000. Also, it’s about 7 times faster than resize!, so I’ll try it elsewhere in my code.

EDIT: Actually, the resize! speed could be a regression in Julia 1.9.4 β†’ 1.11.1.

x = Vector{Int}(undef, 10000)
y =   Vector{Int}(undef, 10000)
display(@benchmark view($x, 1:1000))
display(@benchmark  resize!($y, 1000))

On 1.11.1 this outputs

view:
Range (min … max):  2.416 ns … 15.834 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.542 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.544 ns Β±  0.168 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
resize!
 Range (min … max):  3.958 ns … 17.916 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.042 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.085 ns Β±  0.304 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

And on 1.9.4:

view: 
Range (min … max):  1.791 ns … 26.500 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.875 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.910 ns Β±  0.360 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%
resize!:
 Range (min … max):  1.791 ns … 24.166 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.875 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.915 ns Β±  0.405 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

Thanks, I’ll benchmark this with my actual (realistic) code. Do you have some pointers which OS/hardware/GC parameters could be relevant here? Perhaps I could adjust vector sizes programmatically, based on those.

This benchmark result is very weird and probably misleading. resize! should also be basically a zero-cost operation. How can it be slower than, well, anything?

Note that this is only the case on 1.10.4, in 1.11.1 I also got 3 allocations for both sizes.

This doesn’t seem to be true in 1.11: resize! essentially just calls _deleteend!, which is implemented as

function _deleteend!(a::Vector, delta::Integer)
    delta = Int(delta)
    len = length(a)
    0 <= delta <= len || throw(ArgumentError("_deleteend! requires delta in 0:length(a)"))
    newlen = len - delta
    for i in newlen+1:len
        @inbounds _unsetindex!(a, i)
    end
    setfield!(a, :size, (newlen,))
    return
end

(while in earlier versions it’s a ccall). So apart from changing the size attribute, it is also explicitly looping over all β€˜deleted’ elements.

2 Likes

That’s shocking. The whole point of resize! is that it should be zero cost (or O(1)). What is the point of deleting the elements?

the intent, at least is that the entire loop should disappear for common eltypes. it needs to be there (and was there in C) because you don’t want deleted memory in the array keeping alive data since otherwise that would be a horrible way to have nasty memory leaks. for bitstypes it (hopefully) codegens into nothing

1 Like

But benchmarks (I tried some myself) showed O(n) behavior for Float64. Is that unexpected?

1 Like

Windows 11, Julia v1.11.1, and I also see the consistently shorter minimum, median, and average times for allocating the smaller vector.

I got a little more reasonable result for Julia 1.9.4. Running this code

function allocation(a::Int=1000, b::Int=10_000)
    println("Allocation")
    println(a)
    display(@benchmark Vector{Int}(undef, $a))
    println(b)
    display(@benchmark Vector{Int}(undef, $b))
    println(a)
    display(@benchmark Vector{Int}(undef, $a))
    println(b)
    display(@benchmark Vector{Int}(undef, $b))
    println("---------------------------")
end
allocation()

after a few times outputs


The median time is smaller for the smaller vector, whereas the average is still larger. However, for Julia 1.11.1, the outputs are the same as in my OP, even after several runs.

it definitely shouldn’t be. time to dig in to some profiling.

Vector{Int}(undef, 1000) is basically only allocating on 1.10 (calling jl_alloc_array_1d and thus malloc), and since not initializing the speed should be independent of size(?):

julia> @code_lowered Vector{Int}(undef, 1000)
CodeInfo(
1 ─ %1 = Core.cconvert(Core.Int, m)
β”‚   %2 = Core.apply_type(Core.Array, $(Expr(:static_parameter, 1)), 1)
β”‚   %3 = Core.unsafe_convert(Core.Int, %1)
β”‚   %4 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{T, 1}, svec(Any, Int64), 0, :(:ccall), :(%2), :(%3), :(%1)))
└──      return %4
)

I believe the benchmarking inaccurate, allocating isn’t directly responsible for the GC activity (i.e. if you’re not runniing out of memory you shouldn’t have GC triggered, but it happens because of benchmarking in a loop), it is actually shown to be zero for min., but sometimes you get it for mean (and sometimes for median, but sometimes no GC activity), so I think it means cost of releasing the memory, and thus GC activity.

The min stays almost the same when allocating 10x (276.351 ns for me), while some other numbers go up, and then do go up with another 10x for 100x allocated.

However on 1.11 I see larger assembly with @code_native and different/larger code with:

@code_lowered Vector{Int}(undef, 1000)
CodeInfo(
1 ─ %1 = Core.fieldtype
β”‚   %2 = Core.fieldtype(self, :ref)
β”‚   %3 = (%1)(%2, :mem)
β”‚   %4 = Core.undef
β”‚        mem = (%3)(%4, m)
β”‚   %6 = mem
β”‚   %7 = Core.memoryref(%6)
β”‚   %8 = Core.tuple(m)
β”‚   %9 = %new(self, %7, %8)
└──      return %9

I think/thought malloc in general does not initialize:

I still think on Windows it woulldn’t initialize, but one caveat, is that first allocations need to come from the kernel (basically old memory from other processes), and on any OS, then it must initialize for security reasons.

I would thus trust min. numbers. when only allocating, i.e. when using undef (this does not apply to e.g. zeros, that does fill and is then of course linear in speed).

1 Like

That’s very insightful, thanks. And CodeInfo is a great tool, which I’ll use.

For accurate benchmarking in my case, would you suggest (A) to benchmark different code (to my original)? Or (B) to use the same code, but only look at the minimum time?

If (B) is the way, does it mean that Julia’s standard benchmarking can be inaccurate due to GC activity? Then there could be implications for lots of benchmarking code.