Push! and insert! very slow? (julia 1.11) (EDIT: No; just one-time allocations.)

Hi all, I felt that part of my code ran slowly, and then noticed runtimes of push! and insert! on the order of µs. In contrast, other in-place methods such as deleteat! or setindex! are on the order of ns.

Are these runtimes expected? They seem slow to me. I’m on mac, julia 1.11.1.

using BenchmarkTools

@btime insert!(a, 59_999, 0) setup=(a=collect(1:60000)) evals=1
# 15.358 μs (3 allocations: 1.02 MiB)

@btime push!(a, 0) setup=(a=collect(1:60000)) evals=1
# 14.378 μs (2 allocations: 1.02 MiB)

@btime deleteat!(a, 59_999) setup=(a=collect(1:60000)) evals=1
# 57.000 ns (0 allocations: 0 bytes)

@btime setindex!(a, 0, 59_999) setup=(a=collect(1:60000)) evals=1
# 36.000 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 20 × Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 20 default, 0 interactive, 10 GC (on 20 virtual cores)
Environment:
  JULIA_EDITOR = code

Isn’t it just because they both allocate, whereas the other two don’t? In your benchmark, setting evals=1 essentially guarantees an allocation (and copy). If you don’t set evals=1 you get indeed much faster times.

2 Likes

When you make a Vector, it allocates a chunk of memory to hold its items. There is often some buffer length allocated (more space than the current length of the buffer), but not too much so it doesn’t take too much memory.

When you lengthen the Vector, if you exceed the length of the underlying allocation, Julia needs to request more space from the operating system, which is slow. But even if it doesn’t, it still needs to check the length and the extra branch makes it slower.

In-place operations don’t have this overhead.

Edit: Also, because of the numerous branches in a push! it can’t be vectorized in the CPU. So multiple push! operations are slower than setindex! for that reason too.

Indeed this seems to be much improved in amortized complexity, but I still don’t understand why the very first eval has to allocate.

julia> @benchmark push!(a, 0) setup=(a=collect(1:60000)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  27.228 μs … 455.899 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.450 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   98.648 μs ± 117.440 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▃                                 ▁ ▂▃▄▃▃▃▂▂▁             ▂
  █████▅▃▃▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▄▅▇▇██████████████████████▇███ █
  27.2 μs       Histogram: log(frequency) by time       407 μs <

 Memory estimate: 570.62 KiB, allocs estimate: 1.

julia> @benchmark push!(a, 0) setup=(a=collect(1:60000)) evals=100
BenchmarkTools.Trial: 10000 samples with 100 evaluations.
 Range (min … max):  293.990 ns … 4.222 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     355.125 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   928.474 ns ± 1.086 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▅▂▁                                      ▃▃▃▃▃▂▂▂▁        ▂
  █████▇▆▆▅▅▄▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▇▇▇▇▅▆▆████████████▇▇▆▇▇ █
  294 ns       Histogram: log(frequency) by time      3.61 μs <

 Memory estimate: 5.71 KiB, allocs estimate: 0.

julia> @benchmark push!(a, 0) setup=(a=collect(1:60000)) evals=10000
BenchmarkTools.Trial: 10000 samples with 10000 evaluations.
 Range (min … max):  11.784 ns … 83.091 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.178 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.391 ns ± 14.431 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▅▃▂▁▁▂▃▃▂▁                 ▂▃▄▄▃▃▃▃▂▁                    ▂
  █████████████▇▇▇▆▇▇▆▇▆▅▆▆███▇▇████████████▇▆▇▆▆▅▇▆▅▇▆▇▆▆▇▆▆ █
  11.8 ns      Histogram: log(frequency) by time      67.2 ns <

 Memory estimate: 58 bytes, allocs estimate: 0.

I precisely used the setup keyword to not make the input array a allocate. At least I thought that’s the purpose of setup=, maybe I’m wrong.

Also evals=1 is on purpose, because this is closest to my actual code I’m interested in.

the first call has to allocate because the underlying memory is the same size as the array, so you have to get a new place to store the data (and copy the array over to the new place)

2 Likes

My guess is that for large arrays it allocates an exact size so any size extending operation needs to allocate. If you know you are going to extend the size you should allocate a good size from the start.

4 Likes

As I understand it, the push! call makes the array to extend, and the first time you do that it would need to allocate (and copy all old values to the new location). Julia is smart enough to ask more memory that you need so to avoid doing all this copy stuff too often (the total overhead is amortized linear time). But this is if you push! repeatedly on the same vector, not just the first push! (which is indeed relatively more costly).

1 Like

If you can anticipate that you’re gonna do exactly one push!, perhaps you could allocate the array with size n+1 directly?

4 Likes

ok thanks all!

The code I’m implementing leads effectively to a symmetric random walk of the array length. It is a balanced insert! by one and deleteat! by one, leading to the same array length of 60_000 on average. But individual simulations of course hit the bounds regularly, and increase the array to a length greater than 60000.

Maybe this repetitive “hitting” of the array length bound is my problem… I will try with a sizehint!.

Just sizehint! to 60000 + T where T is the number of steps and you’ll be deterministically safe. If T is small enough this will work perfectly, if T is much larger than 60000 then you can be less conservative and use something like the standard deviation of the random walk.
EDIT: I wonder if deleteat! might undo the effects of sizehint! at some point.

3 Likes

ok I’m seeing an effect of sizehint! in terms of the overall allocations, but hardly on overall runtime. Potentially the code is at the optimum, or there is something else going on. But thanks for the moment!!

EDIT: And sorry for the noisy title (I’ll add a “no”); I have now julia 1.10 installed in parallel, to test for regressions before asking :smiley:

3 Likes

Why does insert! have one more allocations than push!? It has 3 (or zero) on 1.11, vs. 1 in 1.10.

It’s not inherently more complex than push! (when near the end), i.e.needing an additional allocation.

push! takes 1 or 2 allocations on 1.10.6 (vs 2 on 1.11):

julia> a=collect(1:60000);

julia> @time push!(a, 0);
  0.000573 seconds (1 allocation: 570.625 KiB)

julia> a=collect(1:60000);

julia> @time push!(a, 0);
  0.000172 seconds (2 allocations: 1.015 MiB)

On 1.11 it seems to be always 2 allocations with intermediate timing (or when already enlarged 0 allocations and very quick):

julia> @time push!(a, 0);
  0.000385 seconds (2 allocations: 1.015 MiB)

You might think you’re deleting at the end but you’re not since Julia is 1-based. When you’re not it has to move around, “to the right” there only 1 item though, and I tested deleting at the front, and it’s as quick. I.e. then it shifts to the left, not right as seen by higher address with:

pointer(a)

I’m looking into push!, it calls Base._growend!(a, 1) and it calls newmem = Base.array_new_memory(mem, newmemlen2) that allocates twice.

If the position passed to deleteat! is not at the end, then you’ll have the overhead of copying all following data back by one position…

I now think confirmed* this is the actual bottleneck in my code; that insert! and deleteat! have to copy data back or forth by one position. This also fits to the observation that sizehint! had little effect on runtime in my case.

Interestingly, the benchmarks below indicate that insert! and deleteat! are slower when the modified indices are in the end, and almost as fast as push! when the array is modified in the starting indices (?).

using BenchmarkTools

@btime insert!(a, 59_999, 0) setup=(a=collect(1:60000)) evals=10000
# 414.436 ns (0 allocations: 106 bytes)

@btime push!(a, 0) setup=(a=collect(1:60000)) evals=10000
# 4.684 ns (0 allocations: 106 bytes)

@btime deleteat!(a, 49_999) setup=(a=collect(1:60000)) evals=10000
# 521.819 ns (0 allocations: 0 bytes)

@btime setindex!(a, 0, 59_999) setup=(a=collect(1:60000)) evals=10000
# 2.167 ns (0 allocations: 0 bytes)


@btime insert!(a, 2, 0) setup=(a=collect(1:60000)) evals=10000
# 12.900 ns (0 allocations: 106 bytes)

@btime push!(a, 0) setup=(a=collect(1:60000)) evals=10000
# 4.668 ns (0 allocations: 106 bytes)

@btime deleteat!(a, 2) setup=(a=collect(1:60000)) evals=10000
# 9.832 ns (0 allocations: 0 bytes)

@btime setindex!(a, 0, 2) setup=(a=collect(1:60000)) evals=10000
# 1.846 ns (0 allocations: 0 bytes)

*EDIT: I replaced insert! and deleteat! with push! and pop! which is not what I want but keeps the overall nature of the code, only removing the effect of the extra copying work of insert! and deleteat!. This showed drastically faster run times of the overall simulation algorithm.

Yes, if the insert! is exactly at the beginning (or if you use
pushfirst!), julia will allocate and move everything to the right, but it will do it leaving some empty space before the start of the vector so that further pushfirst!s are fast.

2 Likes
  • deleteat! at the last position is fast (it doesn’t need to move anything nor allocate). The benchmark is no good because you’re not removing at the end.
    Try with @btime deleteat!(a, length(a)) setup=(a=collect(1:60000)) evals=10000
  • insert! at the last position is also fast (only allocates but doesn’t need to move anything. The benchmark is wrong because it is not inserting at the end after the first eval. Try with: @btime insert!(a, length(a)+1, 0) setup=(a=collect(1:60000)) evals=10000

Yes, thanks, then we have a complete view now! :smiley:

But I wouldn’t call my benchmark wrong, the choice of my indices were on purpose. I meant “rather at the start” vs. “rather at the end” of the array (not precisely first and last index). Because my simulation draws random indices, the very first and very last index are negiblible for my overall performance.

This behaviour was just interesting to see for me.

2 Likes

The distinction that you’re maybe missing is that index 2 is very near the start for all 10000 evals, whereas index 59999 is near the end at the first eval but not so near at the last eval when there are 70000 elements in the vector. I.e. there is a difference in how much data needs to be moved around in the two benchmarks.

Since moving the data around is a bottleneck you should look for ways to reduce that. How important is it to keep the ordering of your elements? Are there other ways to handle that? What are the actual requirements on the vector?

2 Likes

I apologize, with “wrong” I just meant to say that they weren’t benchmarking what you wrote “[…] the benchmarks below indicate that insert! and deleteat! are slower when the modified indices are in the end […]”. Sorry that it came out unnecessarily negative. I agree, it was an interesting behavior to see and verify. :smile: