Performance of resize! vs pre-allocating with zeros(...)

I observe a factor of 5x time difference in writing over a vector with two different preallocation strategies.

  1. Using x = zeros(Int64, n)
  2. Using x = Int64[]; resize!(x,n)

The benchmark code is bellow

using BenchmarkTools

function test_access!(x,n)
  @inbounds for i = 1 : n
    x[i] = i
  end
end

function init(n)
  x = Int64[]
  resize!( x , n )
end

n = 10_000_000;

@btime test_access!(x,$n) setup=(x = zeros(Int64,$n))
  5.025 ms (0 allocations: 0 bytes)

@btime test_access!(x2,$n) setup=(x2 = init($n))
  26.381 ms (0 allocations: 0 bytes)

I was expecting these codes to have similar performance. Any ideas on why there is such a big difference?

I can confirm this. It is very odd.

zeros(Int64, n) does initialize the Vector with zeros, Int64[]; resize!(x,n) only ask for the memory (if you print it before overwriting you will see garbage). Probably just for the setup the resize! solution is faster.

The difference in access performance you are seeing probably comes from zeros having already materialized and brought the Vector to the L1 cache, while accessing the positions in the Vector expanded by resize will cause a page fault (i.e., the OS was lazily waiting for you to access any position to only then really allocate the memory), and then some cache misses of bringing the Vector slowly into the L1 cache.

6 Likes

Looks like @Henrique_Becker is right, the MWE

using BenchmarkTools

function test_access!(x,n)
  @inbounds for i = 1 : n
    x[i] = i
  end
end

function init1(n)
  x = Int64[]
  resize!( x , n )
  test_access!( x, n )
  return x
end

function init2(n)
    x = zeros(Int64, n)
    test_access!( x, n )
    return x
end

function init3(n)
    x = Vector{Int64}(undef, n)
    test_access!( x, n )
    return x
end

const n = 10_000_000;

@btime init1($n)
@btime init2($n)
@btime init3($n)

shows the expected

  13.780 ms (2 allocations: 76.29 MiB)
  18.666 ms (2 allocations: 76.29 MiB)
  15.528 ms (2 allocations: 76.29 MiB)
2 Likes

I agree with your explanation. One strategy is lazy allocation, the other is eager. In the lazy case, the pages are not allocated by the OS until needed, hence the penalty when visiting them for the first time.

Thanks!

1 Like

The fastest should be Vector{Int}(undef, n), because it doesn’t have to allocate anew and doesn’t have to call into the runtime again.

1 Like

Please refer to the following thread for an in-depth discussion about the (dis-)advantages of using calloc, as well as links to further discussions.

2 Likes

I believe Vector{Int}(undef, n) can only be the same or faster. However, if we look at the number of allocations I am not sure if x = Int[]; resize!(x, n) does not end up being the same; because, maybe, x = Int[] does not allocate a memory space in the heap until you start putting things inside it or call resize! (i.e., it strongly relies on the asymptotic O(1) behaviour of push!).

I started the registration process for GitHub - mkitti/ArrayAllocators.jl: Allocate arrays with malloc, calloc, or on NUMA nodes yesterday. If you wanted to explicitly initialize the array, you can use fill!. This uses the C call memset under the hood.

using ArrayAllocators

julia> @btime test_access!(x,$n) setup=(x = zeros(Int64,$n))
  8.457 ms (0 allocations: 0 bytes)

julia> @btime test_access!(x2,$n) setup=(x2 = init($n))
  17.474 ms (0 allocations: 0 bytes)

julia> @btime test_access!(x2,$n) setup=(x2 = init($n); fill!(x2,0))
  8.026 ms (0 allocations: 0 bytes)

julia> @btime test_access!(x2,$n) setup=(x2 = Array{Int64}(undef, $n))
  19.933 ms (0 allocations: 0 bytes)

julia> @btime test_access!(x2,$n) setup=(x2 = Array{Int64}(undef, $n); fill!(x2, 0))
  8.492 ms (0 allocations: 0 bytes)

julia> @btime test_access!(x2,$n) setup=(x2 = Array{Int64}(calloc, $n))
  16.426 ms (0 allocations: 0 bytes)

julia> @btime test_access!(x2,$n) setup=(x2 = Array{Int64}(calloc, $n); fill!(x2, 0))
  8.472 ms (0 allocations: 0 bytes)

1 Like

If I remember correctly, the minimum size that is always allocated is 32. The resize! has to do an additional call into the C runtime, after allocating the array. The undef version only does a single one, so you’ll always have at least that as a difference.