Why does a vector with 10 times more elements takes 2x-5x less time to pre-allocate?

Olegg · October 27, 2024, 4:33pm

I’m benchmarking pre-allocation of vectors:

display(@benchmark  Vector{Int}(undef, 1000) )
display(@benchmark  Vector{Int}(undef, 10000) )
display(@benchmark  Vector{Int}(undef, 1000) )
display(@benchmark  Vector{Int}(undef, 10000) )

I get these results on my MacBook Pro 2021, Ventura 13.6.2 and Julia 1.11.1

This is counterintuitive: pre-allocating the longer vector takes about 2 and 5 less the average and median time, respectively.

Why would this happen? And how can this be used for faster code?

eldee · October 27, 2024, 5:44pm

I cannot replicate this (on Windows 10), so it’s certainly not a universal phenomenon.

Versioninfo (1.10.4)

Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core™ i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_NUM_THREADS = auto

1.10.4

julia> @benchmark Vector{Int}(undef, 1000)
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  122.634 ns …  38.501 μs  ┊ GC (min … max):  0.00% … 97.11%
 Time  (median):     175.823 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   469.906 ns ± 736.463 ns  ┊ GC (mean ± σ):  36.84% ± 27.70%

  █▅▃        ▁▃▃▃▃▂▃▃▂▁▁                                        ▁
  ████▆▆▃▃▃▁▅████████████▇▆▆▆▆▆▄▄▅▄▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▄▄▄▄▅▅▄▄▄▇ █
  123 ns        Histogram: log(frequency) by time       3.93 μs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

julia> @benchmark Vector{Int}(undef, 10000)
BenchmarkTools.Trial: 10000 samples with 196 evaluations.
 Range (min … max):  611.735 ns … 13.318 μs  ┊ GC (min … max):  0.00% … 88.09%
 Time  (median):     878.571 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.689 μs ±  1.234 μs  ┊ GC (mean ± σ):  40.63% ± 29.03%

  ▅▆▇█▆▂▅▂▂▁           ▁        ▂▃▂▂▂▂▃▅▆▆▅▄▃▂▁                ▂
  ███████████▆▆▅▃▁▃▁▁▅▅██▇▅▆▆▅▆███████████████████▇▇▆▇▇▆▆▆▅▇▅▅ █
  612 ns        Histogram: log(frequency) by time      4.57 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

1.11.1

julia> @benchmark Vector{Int}(undef, 1000)
BenchmarkTools.Trial: 10000 samples with 964 evaluations.
 Range (min … max):  136.826 ns …   6.270 μs  ┊ GC (min … max):  0.00% … 83.47%
 Time  (median):     164.627 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   438.752 ns ± 484.669 ns  ┊ GC (mean ± σ):  43.07% ± 33.43%

  █▃▂▁      ▃▅▄▂▁▁▁            ▁▁▁                              ▁
  █████▆▅▅▆▇█████████▇▇▇▇▆▆▇▇▇█████▇▇▆▅▆▆▅▆▆▅▄▅▅▅▅▄▄▆▅▆▆▆▅▅▅▅▅▄ █
  137 ns        Histogram: log(frequency) by time       2.45 μs <

 Memory estimate: 7.88 KiB, allocs estimate: 3.

julia> @benchmark Vector{Int}(undef, 10000)
BenchmarkTools.Trial: 10000 samples with 178 evaluations.
 Range (min … max):  783.146 ns … 13.634 μs  ┊ GC (min … max):  0.00% … 74.92%
 Time  (median):       1.163 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):     2.383 μs ±  1.610 μs  ┊ GC (mean ± σ):  54.82% ± 36.67%

  ▅▇▇█▆▂▁                       ▂▆▆▆▆▅▄▃▂▁▁▁  ▁                ▂
  █████████▇▄▁▅▅▄▄██▆▆▅▄▄▁▁▁▄▁▄▆█████████████████▇▇▆▇█▇▆▆▆▆▆▅▅ █
  783 ns        Histogram: log(frequency) by time      6.38 μs <

 Memory estimate: 78.19 KiB, allocs estimate: 3.

I suppose you could just allocate too much and then take a view.

x = Vector{Int}(undef, alloc_length)  # e.g. 10_000
x = view(x, 1:desired_length)  # e.g. 1_000

Oscar_Smith · October 27, 2024, 5:47pm

this has to do with the speed of malloc vs your system allocator. note that this benchmark may be misleading since you might be ending up in a place where you are allocating, running a trivial GC and then freeing, where a more realistic workload wouldn’t show this behavior

Olegg · October 27, 2024, 6:58pm

It’s useful to know about OS dependence, thanks for checking. Interestingly, allocs estimate = 3 in either of my cases, whereas for you it’s 1 and 2 for the smaller and larger case.

The view solution is indeed faster than direct allocation for 1000, and almost as fast as for 10,000. Also, it’s about 7 times faster than resize!, so I’ll try it elsewhere in my code.

EDIT: Actually, the resize! speed could be a regression in Julia 1.9.4 → 1.11.1.

x = Vector{Int}(undef, 10000)
y =   Vector{Int}(undef, 10000)
display(@benchmark view($x, 1:1000))
display(@benchmark  resize!($y, 1000))

On 1.11.1 this outputs

view:
Range (min … max):  2.416 ns … 15.834 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.542 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.544 ns ±  0.168 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
resize!
 Range (min … max):  3.958 ns … 17.916 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.042 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.085 ns ±  0.304 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

And on 1.9.4:

view: 
Range (min … max):  1.791 ns … 26.500 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.875 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.910 ns ±  0.360 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
resize!:
 Range (min … max):  1.791 ns … 24.166 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.875 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.915 ns ±  0.405 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

Olegg · October 27, 2024, 7:05pm

Thanks, I’ll benchmark this with my actual (realistic) code. Do you have some pointers which OS/hardware/GC parameters could be relevant here? Perhaps I could adjust vector sizes programmatically, based on those.

DNF · October 27, 2024, 7:57pm

This benchmark result is very weird and probably misleading. resize! should also be basically a zero-cost operation. How can it be slower than, well, anything?

eldee · October 27, 2024, 8:10pm

Note that this is only the case on 1.10.4, in 1.11.1 I also got 3 allocations for both sizes.

This doesn’t seem to be true in 1.11: resize! essentially just calls _deleteend!, which is implemented as

function _deleteend!(a::Vector, delta::Integer)
    delta = Int(delta)
    len = length(a)
    0 <= delta <= len || throw(ArgumentError("_deleteend! requires delta in 0:length(a)"))
    newlen = len - delta
    for i in newlen+1:len
        @inbounds _unsetindex!(a, i)
    end
    setfield!(a, :size, (newlen,))
    return
end

(while in earlier versions it’s a ccall). So apart from changing the size attribute, it is also explicitly looping over all ‘deleted’ elements.

DNF · October 27, 2024, 8:47pm

That’s shocking. The whole point of resize! is that it should be zero cost (or O(1)). What is the point of deleting the elements?

Oscar_Smith · October 27, 2024, 10:14pm

the intent, at least is that the entire loop should disappear for common eltypes. it needs to be there (and was there in C) because you don’t want deleted memory in the array keeping alive data since otherwise that would be a horrible way to have nasty memory leaks. for bitstypes it (hopefully) codegens into nothing

DNF · October 27, 2024, 10:29pm

But benchmarks (I tried some myself) showed O(n) behavior for Float64. Is that unexpected?

Benny · October 27, 2024, 10:36pm

Windows 11, Julia v1.11.1, and I also see the consistently shorter minimum, median, and average times for allocating the smaller vector.

Olegg · October 27, 2024, 11:08pm

I got a little more reasonable result for Julia 1.9.4. Running this code

function allocation(a::Int=1000, b::Int=10_000)
    println("Allocation")
    println(a)
    display(@benchmark Vector{Int}(undef, $a))
    println(b)
    display(@benchmark Vector{Int}(undef, $b))
    println(a)
    display(@benchmark Vector{Int}(undef, $a))
    println(b)
    display(@benchmark Vector{Int}(undef, $b))
    println("---------------------------")
end
allocation()

after a few times outputs

The median time is smaller for the smaller vector, whereas the average is still larger. However, for Julia 1.11.1, the outputs are the same as in my OP, even after several runs.

Oscar_Smith · October 27, 2024, 11:58pm

it definitely shouldn’t be. time to dig in to some profiling.

Palli · October 28, 2024, 11:20am

Vector{Int}(undef, 1000) is basically only allocating on 1.10 (calling jl_alloc_array_1d and thus malloc), and since not initializing the speed should be independent of size(?):

julia> @code_lowered Vector{Int}(undef, 1000)
CodeInfo(
1 ─ %1 = Core.cconvert(Core.Int, m)
│   %2 = Core.apply_type(Core.Array, $(Expr(:static_parameter, 1)), 1)
│   %3 = Core.unsafe_convert(Core.Int, %1)
│   %4 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{T, 1}, svec(Any, Int64), 0, :(:ccall), :(%2), :(%3), :(%1)))
└──      return %4
)

I believe the benchmarking inaccurate, allocating isn’t directly responsible for the GC activity (i.e. if you’re not runniing out of memory you shouldn’t have GC triggered, but it happens because of benchmarking in a loop), it is actually shown to be zero for min., but sometimes you get it for mean (and sometimes for median, but sometimes no GC activity), so I think it means cost of releasing the memory, and thus GC activity.

The min stays almost the same when allocating 10x (276.351 ns for me), while some other numbers go up, and then do go up with another 10x for 100x allocated.

However on 1.11 I see larger assembly with @code_native and different/larger code with:

@code_lowered Vector{Int}(undef, 1000)
CodeInfo(
1 ─ %1 = Core.fieldtype
│   %2 = Core.fieldtype(self, :ref)
│   %3 = (%1)(%2, :mem)
│   %4 = Core.undef
│        mem = (%3)(%4, m)
│   %6 = mem
│   %7 = Core.memoryref(%6)
│   %8 = Core.tuple(m)
│   %9 = %new(self, %7, %8)
└──      return %9

I think/thought malloc in general does not initialize:

I still think on Windows it woulldn’t initialize, but one caveat, is that first allocations need to come from the kernel (basically old memory from other processes), and on any OS, then it must initialize for security reasons.

I would thus trust min. numbers. when only allocating, i.e. when using undef (this does not apply to e.g. zeros, that does fill and is then of course linear in speed).

Olegg · October 30, 2024, 1:13pm

That’s very insightful, thanks. And CodeInfo is a great tool, which I’ll use.

For accurate benchmarking in my case, would you suggest (A) to benchmark different code (to my original)? Or (B) to use the same code, but only look at the minimum time?

If (B) is the way, does it mean that Julia’s standard benchmarking can be inaccurate due to GC activity? Then there could be implications for lots of benchmarking code.

Topic		Replies	Views
Performance of resize! vs pre-allocating with zeros(...) Performance	9	1111	April 29, 2022
Push! versus preallocation New to Julia	17	2784	June 11, 2020
Pre-allocating array efficiency Performance	5	274	October 12, 2024
Memory allocations when returning vectors General Usage array , memory-allocation	15	1491	June 6, 2018
Increase in allocations with Julia v1.11-beta Internals & Design performance , arrays	46	3142	May 8, 2024

Why does a vector with 10 times more elements takes 2x-5x less time to pre-allocate?

Related topics