Documenting performance model of `empty!`, `sizehint!`, `push!` & friends

proposal
#1

It is a common idiom to allocate some mutable container, push! or append! values into it, optionally by providing a sizehint!, then resize or empty! it, and reuse between runs.

This leads to performant code with little effort, especially if the number of elements collected cannot be known before each run, but are somewhat similar.

However, I learned most of this from discussions like this, not the documentation.

I propose that the following is documented:

  1. push! and append! methods may (but are not required to) preallocate extra storage. They do preallocate for types in Base, using a heuristic which is optimal for the general use case.

  2. sizehint! may control this preallocation. It does this for types in Base.

  3. empty! is nearly costless (and O(1)) for types that support this kind of preallocation.

None of this is breaking, just documenting the status quo.

Furthermore, we could think about an API to query the current sizehint! setting. For example, imagine that one is collecting random elements, usually less than 10, but very infrequently around 10000. empty! would keep that large buffer around, while one could imagine that the caller could check for this and resize to something smaller.

8 Likes
#2

This would be great to have. I would open an issue proposing exactly this and then once there’s agreement on what is guaranteed and what is currently the case turn that into actual docs.

#3

Great, I will do it.

#4

FWIW,

julia> function pushtest_hint!(n, v = Vector{Int}())
       empty!(v)
       sizehint!(v, n)
       for i=1:n
       push!(v, i)
       end
       v
       end;
julia> function filltest!(n, v = Vector{Int}())
       resize!(v, n)
       for i=1:n
       v[i]=i
       end
       v
       end;
julia> function pushtest_nohint!(n, v = Vector{Int}())
       empty!(v)
       for i=1:n
       push!(v, i)
       end
       v
       end;

tested with

julia> N1=10_000; N2 = 10_000_000; N3 = 50_000_000; v1=zeros(Int, N1); v2=zeros(Int, N2); v3=zeros(Int, N3);
julia> using BenchmarkTools
julia> begin 
       run(`cat /sys/kernel/mm/transparent_hugepage/enabled`)
       for (N,v) in [(N1,v1), (N2,v2), (N3,v3)]
       @show N, :no_reuse
       @btime pushtest_nohint!($N)
       @btime pushtest_hint!($N)
       @btime filltest!($N)
       @show N, :reuse
       @btime pushtest_nohint!($N, $v)
       @btime pushtest_hint!($N, $v)
       @btime filltest!($N, $v)
       end; end;

yields on my machine

[always] madvise never
(N, :no_reuse) = (10000, :no_reuse)
  79.359 μs (14 allocations: 256.64 KiB)
  62.778 μs (2 allocations: 78.27 KiB)
  8.214 μs (2 allocations: 78.27 KiB)
(N, :reuse) = (10000, :reuse)
  61.888 μs (0 allocations: 0 bytes)
  61.914 μs (0 allocations: 0 bytes)
  5.836 μs (0 allocations: 0 bytes)
(N, :no_reuse) = (10000000, :no_reuse)
  105.401 ms (24 allocations: 129.00 MiB)
  71.133 ms (2 allocations: 76.29 MiB)
  14.669 ms (2 allocations: 76.29 MiB)
(N, :reuse) = (10000000, :reuse)
  63.381 ms (0 allocations: 0 bytes)
  63.283 ms (0 allocations: 0 bytes)
  8.419 ms (0 allocations: 0 bytes)
(N, :no_reuse) = (50000000, :no_reuse)
  537.729 ms (26 allocations: 416.61 MiB)
  356.954 ms (2 allocations: 381.47 MiB)
  141.562 ms (2 allocations: 381.47 MiB)
(N, :reuse) = (50000000, :reuse)
  317.353 ms (0 allocations: 0 bytes)
  316.974 ms (0 allocations: 0 bytes)
  42.260 ms (0 allocations: 0 bytes)

and

always [madvise] never
(N, :no_reuse) = (10000, :no_reuse)
  82.518 μs (14 allocations: 256.64 KiB)
  62.817 μs (2 allocations: 78.27 KiB)
  8.026 μs (2 allocations: 78.27 KiB)
(N, :reuse) = (10000, :reuse)
  65.603 μs (0 allocations: 0 bytes)
  61.832 μs (0 allocations: 0 bytes)
  5.481 μs (0 allocations: 0 bytes)
(N, :no_reuse) = (10000000, :no_reuse)
  176.516 ms (24 allocations: 129.00 MiB)
  95.101 ms (2 allocations: 76.29 MiB)
  39.220 ms (2 allocations: 76.29 MiB)
(N, :reuse) = (10000000, :reuse)
  66.751 ms (0 allocations: 0 bytes)
  63.399 ms (0 allocations: 0 bytes)
  9.020 ms (0 allocations: 0 bytes)
(N, :no_reuse) = (50000000, :no_reuse)
  837.384 ms (26 allocations: 416.61 MiB)
  476.236 ms (2 allocations: 381.47 MiB)
  193.821 ms (2 allocations: 381.47 MiB)
(N, :reuse) = (50000000, :reuse)
  334.521 ms (0 allocations: 0 bytes)
  319.596 ms (0 allocations: 0 bytes)
  44.946 ms (0 allocations: 0 bytes)

Take-away: Tuning THP in the linux kernel is just as relevant as sizehint! (I learned that yesterday; many distros use madvise as default; defrag is also a relevant tunable). The foreign function call overhead from push! is more important than sizehint!. If you insist about performance, then don’t use push!, instead run your own overallocation scheme (resize! to something big, fill up tracking the current index, afterwards resize! and sizehint! to shrink back; effectively what IOBuffer does).

#5

Isn’t this

This is surely related, but orthogonal to the ideal behavior of push! friends once the above is fixed.

#6

issue opened:

1 Like
#7

It is exactly this issue. Unfortunately I don’t see this getting fixed soon, because my efforts failed so far, and afaik no one else wants to work on this, and none of the professional orgs prioritize this issue (imo rightfully: there are more important things going on).

As the situation currently stands, I don’t think emphasizing the “ideal” performance model with sizehint! is super helpful, since it is too far away from current reality. Reuse, resize! (iobuffer-style, i.e. kristoffer.carlsson’s joking pushvector suggestion) and even OS tuning look more relevant to me, and we should not want the docs to implicitly suggest that sizehint! touches the root causes of bad push! performance.

On the other hand, I’d agree very much with properly documenting memory overhead (e.g. empty! does not release memory, sizehint! does). Slowness determines caffeine consumption when waiting for results, memory consumption determines problem sizes that people can handle at all without buying new hardware.

But also for memory consumption, the resizing strategy (currently: doubling) is probably not more important than OS tuning (no one cares how much virtual address space we overallocate, the relevant measure is how many of these pages are resident).

#8

FWIW, until the above issue is fixed, I packaged up the workaround of Kristoffer Carlsson and registered it as

2 Likes
#9

With:

note that you should not use the original after that

, do you mean: “note that you should not push! to v after that”?

1 Like
#10

What the docstring of finish! says (consequences are undefined if you finish!, then push!).

But in fact it may be innocuous. I have to think about it.

#11

Could you exapnd on this? It sounds kind of magical, I’m assuming you do not mean that the compiler somehow sniffs out push! and append! and stack allocates how much it thinks it will need?

#12

No, but push! ends up calling _growend!, which ends up ccalling into jl_array_grow_at_end which in turn checks some stuff and preallocates if necessary. You can just follow the chain down from @edit push!(...) which leads directly into array.c in src :slight_smile:

1 Like
#13

Of course not. What happens is that it allocates more space than needed at the end of the array, so next time you push! you may not need to reallocate.

It is because of this kind of confusion that I think this should be documented.

2 Likes
#14

The docstring says you can’t modify v, which to me sounds like not using setindex! either (although it seems like that should be fine of course).

#15

I haven’t decided how seriously I want to take this package, given that it is a workaround, but if someone opens an issue or a PR about a feature, I will consider it.