Documenting performance model of `empty!`, `sizehint!`, `push!` & friends

Tamas_Papp · April 26, 2019, 9:45am

It is a common idiom to allocate some mutable container, push! or append! values into it, optionally by providing a sizehint!, then resize or empty! it, and reuse between runs.

This leads to performant code with little effort, especially if the number of elements collected cannot be known before each run, but are somewhat similar.

However, I learned most of this from discussions like this, not the documentation.

I propose that the following is documented:

push! and append! methods may (but are not required to) preallocate extra storage. They do preallocate for types in Base, using a heuristic which is optimal for the general use case.
sizehint! may control this preallocation. It does this for types in Base.
empty! is nearly costless (and O(1)) for types that support this kind of preallocation.

None of this is breaking, just documenting the status quo.

Furthermore, we could think about an API to query the current sizehint! setting. For example, imagine that one is collecting random elements, usually less than 10, but very infrequently around 10000. empty! would keep that large buffer around, while one could imagine that the caller could check for this and resize to something smaller.

StefanKarpinski · April 27, 2019, 1:00pm

This would be great to have. I would open an issue proposing exactly this and then once there’s agreement on what is guaranteed and what is currently the case turn that into actual docs.

Tamas_Papp · April 27, 2019, 1:03pm

Great, I will do it.

foobar_lv2 · April 27, 2019, 2:41pm

FWIW,

julia> function pushtest_hint!(n, v = Vector{Int}())
       empty!(v)
       sizehint!(v, n)
       for i=1:n
       push!(v, i)
       end
       v
       end;
julia> function filltest!(n, v = Vector{Int}())
       resize!(v, n)
       for i=1:n
       v[i]=i
       end
       v
       end;
julia> function pushtest_nohint!(n, v = Vector{Int}())
       empty!(v)
       for i=1:n
       push!(v, i)
       end
       v
       end;

tested with

julia> N1=10_000; N2 = 10_000_000; N3 = 50_000_000; v1=zeros(Int, N1); v2=zeros(Int, N2); v3=zeros(Int, N3);
julia> using BenchmarkTools
julia> begin 
       run(`cat /sys/kernel/mm/transparent_hugepage/enabled`)
       for (N,v) in [(N1,v1), (N2,v2), (N3,v3)]
       @show N, :no_reuse
       @btime pushtest_nohint!($N)
       @btime pushtest_hint!($N)
       @btime filltest!($N)
       @show N, :reuse
       @btime pushtest_nohint!($N, $v)
       @btime pushtest_hint!($N, $v)
       @btime filltest!($N, $v)
       end; end;

yields on my machine

[always] madvise never
(N, :no_reuse) = (10000, :no_reuse)
  79.359 μs (14 allocations: 256.64 KiB)
  62.778 μs (2 allocations: 78.27 KiB)
  8.214 μs (2 allocations: 78.27 KiB)
(N, :reuse) = (10000, :reuse)
  61.888 μs (0 allocations: 0 bytes)
  61.914 μs (0 allocations: 0 bytes)
  5.836 μs (0 allocations: 0 bytes)
(N, :no_reuse) = (10000000, :no_reuse)
  105.401 ms (24 allocations: 129.00 MiB)
  71.133 ms (2 allocations: 76.29 MiB)
  14.669 ms (2 allocations: 76.29 MiB)
(N, :reuse) = (10000000, :reuse)
  63.381 ms (0 allocations: 0 bytes)
  63.283 ms (0 allocations: 0 bytes)
  8.419 ms (0 allocations: 0 bytes)
(N, :no_reuse) = (50000000, :no_reuse)
  537.729 ms (26 allocations: 416.61 MiB)
  356.954 ms (2 allocations: 381.47 MiB)
  141.562 ms (2 allocations: 381.47 MiB)
(N, :reuse) = (50000000, :reuse)
  317.353 ms (0 allocations: 0 bytes)
  316.974 ms (0 allocations: 0 bytes)
  42.260 ms (0 allocations: 0 bytes)

and

always [madvise] never
(N, :no_reuse) = (10000, :no_reuse)
  82.518 μs (14 allocations: 256.64 KiB)
  62.817 μs (2 allocations: 78.27 KiB)
  8.026 μs (2 allocations: 78.27 KiB)
(N, :reuse) = (10000, :reuse)
  65.603 μs (0 allocations: 0 bytes)
  61.832 μs (0 allocations: 0 bytes)
  5.481 μs (0 allocations: 0 bytes)
(N, :no_reuse) = (10000000, :no_reuse)
  176.516 ms (24 allocations: 129.00 MiB)
  95.101 ms (2 allocations: 76.29 MiB)
  39.220 ms (2 allocations: 76.29 MiB)
(N, :reuse) = (10000000, :reuse)
  66.751 ms (0 allocations: 0 bytes)
  63.399 ms (0 allocations: 0 bytes)
  9.020 ms (0 allocations: 0 bytes)
(N, :no_reuse) = (50000000, :no_reuse)
  837.384 ms (26 allocations: 416.61 MiB)
  476.236 ms (2 allocations: 381.47 MiB)
  193.821 ms (2 allocations: 381.47 MiB)
(N, :reuse) = (50000000, :reuse)
  334.521 ms (0 allocations: 0 bytes)
  319.596 ms (0 allocations: 0 bytes)
  44.946 ms (0 allocations: 0 bytes)

Take-away: Tuning THP in the linux kernel is just as relevant as sizehint! (I learned that yesterday; many distros use madvise as default; defrag is also a relevant tunable). The foreign function call overhead from push! is more important than sizehint!. If you insist about performance, then don’t use push!, instead run your own overallocation scheme (resize! to something big, fill up tracking the current index, afterwards resize! and sizehint! to shrink back; effectively what IOBuffer does).

Tamas_Papp · April 27, 2019, 2:46pm

Isn’t this

github.com/JuliaLang/julia

push!/pop! always does a ccall

opened 11:22AM - 04 Dec 17 UTC

closed 02:41PM - 27 Oct 23 UTC

chethega

performance arrays

When looking over my code_native, I recently saw that every push! costs me a cal…l into the runtime library. This appears...excessive, at least for the standard case where the push can be accommodated without copying. So, lets compare the push! and the fastpush! First, lets look at what a vector actually is: ``` struct array_T data :: Ptr{Void} length:: UInt64 flags::UInt16 elsize::UInt16 offset::UInt32 nrows::UInt64 ncols_maxsz::UInt64 stuff1::UInt64 stuff2::UInt64 stuff3::UInt64 end ``` This gives us the fastpush: ``` @inline function fastpush!(A::Vector{T},val::T) where T if isbits(T) arr = unsafe_load(convert(Ptr{array_T}, pointer_from_objref(A))) if arr.length + arr.offset < arr.ncols_maxsz unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 2) unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 4) @inbounds A[arr.length+1] = val else push!(A,val) end else push!(A,val) end A end ``` And lets run a race: ``` function pushtest(A,n) resize!(A, 0); for i = 1:n push!(A,i) end nothing end function fastpushtest(A,n) resize!(A, 0) for i = 1:n fastpush!(A,i) end nothing end function settest(A,n) resize!(A, n) for i = 1:n A[i]=i end nothing end ``` yielding: ``` Nm = 100_000_000; A = Vector{Int}(Nm); resize!(A, 0); #after warmup: @time settest(A,Nm); @time settest(A,Nm); 0.356040 seconds (4 allocations: 160 bytes) 0.326640 seconds (4 allocations: 160 bytes) @time pushtest(A,Nm); @time pushtest(A,Nm); 0.899436 seconds (4 allocations: 160 bytes) 0.895555 seconds (4 allocations: 160 bytes) @time fastpushtest(A,Nm); @time fastpushtest(A,Nm); 0.577715 seconds (4 allocations: 160 bytes) 0.606671 seconds (4 allocations: 160 bytes) ``` Hence, a 30% speedup is possible if we manage to get the fastpath inlined. It would mayhaps be nice if it was somehow possible to make llvm aware of julia's runtime lib. For example, compile to llvm_IR (using clang or dragonegg?), and call into a nice library of llvm code including optimization-relevant metadata, and possibly inline current ccall-functions. Not sure whether this performance problem is worth trying to fix (and how much more expensive it is if we lose important registers on the call). It is easy to avoid, though, by implementing the checks by hand (for tight loops that need to push a lot). In a perfect world, the optimizer would reduce to only a single check for unrolled pushing loops. In my applications, I do enough other things between push!/pop! that this ends up being benign. PS. This is related to [https://github.com/JuliaLang/julia/pull/24901](https://github.com/JuliaLang/julia/pull/24901), which removes the boundcheck from push!/pop!. I also posted this in discourse.julialang.org before. The fastpush is deliberately not a pull request, because it is dirty and architecture dependent.

This is surely related, but orthogonal to the ideal behavior of push! friends once the above is fixed.

Tamas_Papp · April 27, 2019, 2:50pm

issue opened:

https://github.com/JuliaLang/julia/issues/31855

foobar_lv2 · April 27, 2019, 6:21pm

It is exactly this issue. Unfortunately I don’t see this getting fixed soon, because my efforts failed so far, and afaik no one else wants to work on this, and none of the professional orgs prioritize this issue (imo rightfully: there are more important things going on).

As the situation currently stands, I don’t think emphasizing the “ideal” performance model with sizehint! is super helpful, since it is too far away from current reality. Reuse, resize! (iobuffer-style, i.e. kristoffer.carlsson’s joking pushvector suggestion) and even OS tuning look more relevant to me, and we should not want the docs to implicitly suggest that sizehint! touches the root causes of bad push! performance.

On the other hand, I’d agree very much with properly documenting memory overhead (e.g. empty! does not release memory, sizehint! does). Slowness determines caffeine consumption when waiting for results, memory consumption determines problem sizes that people can handle at all without buying new hardware.

But also for memory consumption, the resizing strategy (currently: doubling) is probably not more important than OS tuning (no one cares how much virtual address space we overallocate, the relevant measure is how many of these pages are resident).

Tamas_Papp · May 1, 2019, 7:21am

FWIW, until the above issue is fixed, I packaged up the workaround of Kristoffer Carlsson and registered it as

https://github.com/tpapp/PushVectors.jl

yakir12 · May 1, 2019, 8:09am

With:

note that you should not use the original after that

, do you mean: “note that you should not push! to v after that”?

Tamas_Papp · May 1, 2019, 12:49pm

What the docstring of finish! says (consequences are undefined if you finish!, then push!).

But in fact it may be innocuous. I have to think about it.

ExpandingMan · May 1, 2019, 1:28pm

Could you exapnd on this? It sounds kind of magical, I’m assuming you do not mean that the compiler somehow sniffs out push! and append! and stack allocates how much it thinks it will need?

Sukera · May 1, 2019, 1:42pm

No, but push! ends up calling _growend!, which ends up ccalling into jl_array_grow_at_end which in turn checks some stuff and preallocates if necessary. You can just follow the chain down from @edit push!(...) which leads directly into array.c in src

Tamas_Papp · May 1, 2019, 3:12pm

Of course not. What happens is that it allocates more space than needed at the end of the array, so next time you push! you may not need to reallocate.

It is because of this kind of confusion that I think this should be documented.

ericphanson · May 1, 2019, 3:49pm

The docstring says you can’t modify v, which to me sounds like not using setindex! either (although it seems like that should be fine of course).

Tamas_Papp · May 2, 2019, 4:56am

I haven’t decided how seriously I want to take this package, given that it is a workaround, but if someone opens an issue or a PR about a feature, I will consider it.

Topic		Replies	Views
Resize! and empty! immediately, why? General Usage	7	681	January 4, 2023
Sizehint！, explanation of the naming? General Usage question	3	951	December 4, 2023
Do we have a "sizeforce!"? Performance	2	243	February 12, 2023
Define empty array with sizehint in single step General Usage	6	99	November 26, 2024
How to release the extra buffer after insert! an array? General Usage question	8	151	June 29, 2024

Documenting performance model of `empty!`, `sizehint!`, `push!` & friends

Related topics