push! will always have some performance burden if the reserved memory is not large enough. This is to be expected. Here, I am focusing on the case where the new values will fit into the already reserved memory.
With Julia 1.10 and earlier a ccall is necessary, which takes a bit of time, so push! had a performance disadvantage compared to an indexed assignment.
Now with Julia 1.11 (I am using alpha 2 for the following), push! should consist of a check whether the new value fits, an increase of the length and an indexed assignment. So it should be (at least nearly) as fast as if the user takes care of the check and the length. However, it isn’t:
using BenchmarkTools
function push()
    a = Int[]; sizehint!(a, 1000)
    @benchmark push_loop($a) setup=(empty!($a)) evals=1 samples=10^6
end
function push_loop(a)
    n = 0
    while n < 1000
        n += 1
        push!(a, n)
    end
end
function length_separate()
    a = Array{Int}(undef, 1000)
    @benchmark length_separate_loop($a) evals=1 samples=10^6
end
function length_separate_loop(a)
    length_separate = 0
    n = 0
    while n < 1000
        n += 1
        if n <= length(a.ref.mem)
            length_separate += 1
            a[length_separate] = n
        else
            error("out of bounds")
        end
    end
end
On my computer there is roughly one order of magnitude difference: median 635 ns for push versus 56 ns for the separate length.
The push! function is using _growend! which seems to be more generic than necessary and doing some stuff I do not fully understand, therefore I wanted to simplify the analysis of this difference, because the generic Array seems to be over-engineered for the simple vector use case where it can only grow towards the end. Thus I ended up with this version:
mutable struct SimpleMutableVector{T}
    length::Int
    mem::Memory{T}
end
function simple_mutable_vector()
    a = SimpleMutableVector{Int}(0, Memory{Int}(undef, 1000))
    @benchmark simple_mutable_vector_loop($a) evals=1 samples=10^6
end
function simple_mutable_vector_loop(a)
    a.length = 0
    n = 0
    while n < 1000
        n += 1
        if n <= length(a.mem)
            a.length += 1
            a.mem[a.length] = n
        else
            error("out of bounds")
        end
    end
    return a.mem[end]
end
But with a median of 689 ns this is in the same order of magnitude as the push! version and even a bit slower. However, we learn that we should mostly avoid simple mutable structs in performance critical code, so although having an immutable growing vector feels unnatural it’s worth a try:
struct SimpleVector{T}
    length::Int
    mem::Memory{T}
end
function simple_vector()
    a = SimpleVector{Int}(0, Memory{Int}(undef, 1000))
    @benchmark simple_vector_loop($a) evals=1 samples=10^6
end
function simple_vector_loop(a)
    a = SimpleVector{Int}(0, a.mem)
    n = 0
    while n < 1000
        n += 1
        if n <= length(a.mem)
            a = SimpleVector{Int}(a.length+1, a.mem)
            a.mem[a.length] = n
        else
            error("out of bounds")
        end
    end
    return a.mem[end]
end
This results in a median of 57 ns and is therefore (nearly) as fast as the array usage with separate length.
So what is going on here? Is the measurement of the timing flawed? Why is push! so slow compared to the indexed assignment? Why is the mutable version of SimpleVector so slow compared to the immutable version?
 After all, if another task
 After all, if another task