Why is push! still so slow?

push! will always have some performance burden if the reserved memory is not large enough. This is to be expected. Here, I am focusing on the case where the new values will fit into the already reserved memory.

With Julia 1.10 and earlier a ccall is necessary, which takes a bit of time, so push! had a performance disadvantage compared to an indexed assignment.

Now with Julia 1.11 (I am using alpha 2 for the following), push! should consist of a check whether the new value fits, an increase of the length and an indexed assignment. So it should be (at least nearly) as fast as if the user takes care of the check and the length. However, it isn’t:

using BenchmarkTools

function push()
    a = Int[]; sizehint!(a, 1000)
    @benchmark push_loop($a) setup=(empty!($a)) evals=1 samples=10^6
end

function push_loop(a)
    n = 0
    while n < 1000
        n += 1
        push!(a, n)
    end
end
function length_separate()
    a = Array{Int}(undef, 1000)
    @benchmark length_separate_loop($a) evals=1 samples=10^6
end

function length_separate_loop(a)
    length_separate = 0
    n = 0
    while n < 1000
        n += 1
        if n <= length(a.ref.mem)
            length_separate += 1
            a[length_separate] = n
        else
            error("out of bounds")
        end
    end
end

On my computer there is roughly one order of magnitude difference: median 635 ns for push versus 56 ns for the separate length.

The push! function is using _growend! which seems to be more generic than necessary and doing some stuff I do not fully understand, therefore I wanted to simplify the analysis of this difference, because the generic Array seems to be over-engineered for the simple vector use case where it can only grow towards the end. Thus I ended up with this version:

mutable struct SimpleMutableVector{T}
    length::Int
    mem::Memory{T}
end

function simple_mutable_vector()
    a = SimpleMutableVector{Int}(0, Memory{Int}(undef, 1000))
    @benchmark simple_mutable_vector_loop($a) evals=1 samples=10^6
end

function simple_mutable_vector_loop(a)
    a.length = 0
    n = 0
    while n < 1000
        n += 1
        if n <= length(a.mem)
            a.length += 1
            a.mem[a.length] = n
        else
            error("out of bounds")
        end
    end
    return a.mem[end]
end

But with a median of 689 ns this is in the same order of magnitude as the push! version and even a bit slower. However, we learn that we should mostly avoid simple mutable structs in performance critical code, so although having an immutable growing vector feels unnatural it’s worth a try:

struct SimpleVector{T}
    length::Int
    mem::Memory{T}
end
function simple_vector()
    a = SimpleVector{Int}(0, Memory{Int}(undef, 1000))
    @benchmark simple_vector_loop($a) evals=1 samples=10^6
end
function simple_vector_loop(a)
    a = SimpleVector{Int}(0, a.mem)
    n = 0
    while n < 1000
        n += 1
        if n <= length(a.mem)
            a = SimpleVector{Int}(a.length+1, a.mem)
            a.mem[a.length] = n
        else
            error("out of bounds")
        end
    end
    return a.mem[end]
end

This results in a median of 57 ns and is therefore (nearly) as fast as the array usage with separate length.

So what is going on here? Is the measurement of the timing flawed? Why is push! so slow compared to the indexed assignment? Why is the mutable version of SimpleVector so slow compared to the immutable version?

15 Likes

To set a baseline, here’s the timing I’m getting (on a downloaded 1.11-alpha2):

julia> push() |> display; length_separate() |> display; simple_mutable_vector() |> display; simple_vector()
BenchmarkTools.Trial: 1000000 samples with 1 evaluation.
 Range (min … max):  459.000 ns … 49.070 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     470.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   471.027 ns ± 81.790 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                 █                              
  ▂▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▄ ▂
  459 ns          Histogram: frequency by time          480 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 1000000 samples with 1 evaluation.
 Range (min … max):  49.000 ns …  9.170 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     60.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   60.250 ns ± 21.828 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃                           █                           ▃ ▃
  ▃▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁█ █
  49 ns        Histogram: log(frequency) by time        70 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 1000000 samples with 1 evaluation.
 Range (min … max):  560.000 ns … 13.429 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     600.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   606.287 ns ± 71.233 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▄        ▁█        ▁█         ▃ ▄
  ▄▁▁▁▁▁▁▁▃▆▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▅█▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▅█ █
  560 ns        Histogram: log(frequency) by time       620 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 1000000 samples with 1 evaluation.
 Range (min … max):  50.000 ns …  8.160 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     60.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   62.667 ns ± 20.276 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               █                               
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▇ ▂
  50 ns           Histogram: frequency by time          70 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

which is consistent with what you’re seeing, if I’m not mistaken.

Now, why is there a difference? Let’s take a look at what the LLVM IR looks like for each of these (see this gist, since discourse doesn’t like extremely long posts)

What we can see is that for (2) & (4), there’s lots of instructions like these, indicating successfull vectorization by a factor of 8:

; ┌ @ genericmemory.jl:213 within `setindex!`
   %24 = getelementptr inbounds i64, ptr %0, i64 %offset.idx
   store <8 x i64> %20, ptr %24, align 8
   %25 = getelementptr inbounds i64, ptr %24, i64 8
   store <8 x i64> %21, ptr %25, align 8
   %26 = getelementptr inbounds i64, ptr %24, i64 16
   store <8 x i64> %22, ptr %26, align 8
   %27 = getelementptr inbounds i64, ptr %24, i64 24
   store <8 x i64> %23, ptr %27, align 8

which isn’t the case for (1) and (3). I suspect this is because the length of a.mem being of constant size in (3) can’t be used (the input argument is not constant for the benchmark after all, so the Memory could change), and since a.length is changing in every iteration, any one iteration could throw your error. Worse, since the input is mutable, it might have even changed between iterations, since other tasks can very much push! to your argument while your code is running, which may result in a reallocation of the backing memory.

In contrast, for SimpleVector it’s much easier to prove that the erroring case will never be true; It’s always a different object that is accessed (though the Memory stays the same, since the input cannot be mutated and we always retrieve that same Memory, which can’t change size!), so if it’s impossible to construct one that could make it throw (the accessed indices are bounded by a known constant), the check can be hoisted.

length_separate_loop is actually the most interesting one out of all of these, since I think this should not be as fast as it is :thinking: After all, if another task push!es to that same vector at the same time as your length_seperate_loop runs and causes a new Memory to be allocated (thereby replacing the backing MemoryRef of that Vector), the access in a[length_seperate] = n ought to throw an error if the new length is shorter than length_seperate! In fact, this is (I believe) the reason why push! through _growend! is so complicated - it needs to ensure that the indexing is still valid, without race conditions.

4 Likes

It really doesn’t attempt to ensure that. If you look at the llvm code, there are no atomics involved, meaning that the compiler has full license to reorder.

I suspect that the culprit is the following:

Boundschecks appear to check against vec.size, instead of the more correct vec.ref.mem.length. Evidence:

julia> x=[1]; Base.setfield!(x, :size, (2,)); x
2-element Vector{Int64}:
               1
 127578559461152

This upgrades the race in terms of badness: It turns from “updates were lost, data corrupted” into “general memory corruption, security vuln if exposed” (same as the old variant).

Regardless, there is an invariant about a vectors size and offset/length of the backing memory. I think llvm doesn’t know this, and hence can’t optimize it.

I don’t think push! is thread safe, so it’s not that complicated. Otherwise I think you’re right. If the bounds checking is removed from simple_mutable_vector_loop, the loop is vectorized, just like the simple_vector_loop and length_separate_loop. So, it’s the throw semantics, i.e. it must throw at the right iteration, with the mutable struct modified up to that point.

If it can’t prove that it won’t throw, it can’t vectorize. (This is actually the official reason MS gave for not endorsing java back in the days. It was too hard to make it jit-compile, though IBM did just that.). The push_loop is probably too complicated for the compiler to prove that it does not throw, with an anonymous function potentially doing the allocation of new memory in Base._growend!.

1 Like

Yes, I know that it doesn’t attempt to do that, but that doesn’t mean that the compiler has full license to reorder. The _growend! throws ConcurrencyViolationErrors when it detects that the MemoryRef has been switched out from under it, which you can’t realistically remove just from having access to a Vector because the .ref field is not const. I believe those erroring paths end up preventing vectorization.

That is not correct or evidence for your claim, because a Vector can have more (potentially uninitialized) backing memory available than it presents to the user, through e.g. sizehint!:

julia> a = Vector{Int}(undef, 10);

julia> size(a)
(10,)

julia> a.ref.mem |> length
10

julia> sizehint!(a, 1000);

julia> a.ref.mem |> length
1000

julia> size(a)
(10,)

The size stored in a Vector is the size available for indexing, while the length of the backing memory is its capacity.

Yes, and the important invariant is a.ref.ptr_or_offset + length(a)*sizeof(eltype(a)) <= a.ref.mem.ptr + a.ref.mem.length * sizeof(eltype(a)).

(reverted edit because I failed at using discourse)

Thanks for that analysis. That means both push! (via _growend!) and the loop of the mutable struct are too complicated to be detected by the vectorization detection?

Is there any chance that the detection could be extended to cover this case? This would be a change outside of Julia, wouldn’t it?

Having a vectorized push! function seems to be desirable as a sweet spot between user friendliness and performance, which Julia addresses.

Thanks for the thorough analysis. That means that single pushes should not have the performance disadvantage as no vectorization is possible then, correct?
That hypothesis is backed by the fact, that if the loop limit is reduced to 1, then all variants perform basically identical. However, I am not so sure, whether it’s really measuring anything useful or just overhead.

Even with that invariant, the code as written in the OP for length_seperate shouldn’t vectorize because (without @inbounds) nothing establishes that the a.ref of the Vector doesn’t change while the loop is running. It might point to an entirely different region in memory at every iteration. The length of the array may have shrunk, together with its backing memory (e.g. another task does empty! followed by sizehint!(a, 0) on that same array).

There’s not too much that can be vectorized here in the first place; growing of the array can only happen once, so that’s not vectorizable. At most you could vectorize the writing of actual data, continually checking against the available capacity, but that then runs afoul of the fact that the memory backing that Vector may change location at any moment. Nothing in your loops establishes to the compiler that the indexing will always be in bounds, which is why I’m saying that length_seperated should also be slow here.

Use append! if you have a collection of elements that need appending to the same array, not individual push!. If you’re initializing the array, either give it a sizehint! so the backing memory doesn’t need to be copied around over and over again, or create the array with the correct size directly and write to the (at that point known-inbounds) indices through regular indexing.

Getting push! to vectorize is an up-hill battle against the semantics of arrays.

Looking at the native code, the bounds check in length_separate is hoisted out of the loop. The compiler can prove that it will not surprisingly throw, it’s sufficient to check once prior to the loop. It does not assume that a.ref can change concurrently. With the mutable struct it’s different. It can probably hoist the bounds check out, but it can’t throw before the loop because it would change the semantics: In the event of a bounds fail in the middle of loop, it has already updated the mutable struct in previous iterations, and these changes should be visible in any catch up the call stack. They won’t be if the exception is thrown before the loop.

1 Like

One design that I think would be worth pursuing is

struct SimpleArray{N, T}
    size::NTuple{N,Int}
    mem::Memory{T}
end

and then you can have push functions that return a new SimpleArray without mutation (so you would do v = push(v, 1) rather than the current push!(v,1). This obviously is way too breaking a change to make in 1.x, but it would be possible for a package to do this sort of thing.

1 Like

My point is that it should assume that, since :ref isn’t constant:

julia> Base.isconst(Vector{Int}, :ref)
false

Since the ref can change, so can the underlying Memory, and thus hoisting the length check done on length(a.ref.mem) outside of the loop should be illegal.

2 Likes

Talking to myself here. Actually, the length_separate is split into two loops. It checks the bounds before the loop. If it’s ok, it runs a vectorized loop. But if it determines that there will be a bound fail, it runs a non-vectorized loop with bounds checking instead, to fill the vector up until it throws. With the mutable struct I guess it finds that trick too complicated, and runs a non-vectorized loop anyway.

That’s just the regular non-vectorized path that’s kept around for any cleanup of regions that don’t fit the vector width. e.g. If you have a vector of length 10 and you can use 8-wide SIMD, you can only use that SIMD for the first 8 elements and need to do the remaining 2 sequentially, otherwise you’d read/write out of bounds.

No, I don’t think so. There’s a cleanup loop right after the vector loop. It ends with a jump to the return code. In between there’s a serial loop which can throw, which is jumped to from the bounds check at the start of the function.

I think we’re talking past each other - can you link to the lines from the gist I posted above you’re referring to?

I’ve occasionally been working on certain RNG operations. I recall having a similar experience with mutable RNG states, which had me tinkering with immutable ones instead.

On line 22 it does a conditional jump to main.pseudo.exit, bypassing the vector loop. It continues to L6.postloop on line 120, then to L12.postloop on line 133 (or the error throw). Then either to L24 (which throws bounds error), or through to L27.postloop on line 145, which jumps back to L6.postloop from line 154.

The cleanup after the vector loop is in L6 (line 74) and L27 (which jumps back to L6) which never throws (i.e. never jumps to L24). (I looked first in the native code, which had this in a different order, there my “in between” comment).

I might be reading something wrong here, but isn’t that branch only taken when the array is empty? %isnotneg.inv is true when %.size.sroa.0.0.copyload is less than 0, and that can’t happen. So the branch in line 21 should always return whatever %.not30 is, and that’s only true when %6 is less than 1, i.e. when the array is empty.

Yes, you’re right. It runs the vectorized loop until it’s time for cleanup, either at the end of the loop, or before it’s time to throw the error (line 79). Neat.

So, why doesn’t it vectorize with the mutable struct? The mutable struct loop vectorizes if the error branch is removed, so it can’t be the mutable struct as such. And why should it try to handle concurrency problems? That’s the user’s responsibility?