Improving my mental model: Why is `sizehint!()` so much slower than `zeros()`?

maxkapur · March 31, 2022, 1:47am

I am asking this question in an effort to improve my understanding of computer science, not to complain about Julia’s performance

I understand that, in principle, if we know that we are going to be filling an vector with exactly 1000 entries, it is better to simply declare it as an a 1000-element vector at the outset than to start with [] and push.

I also understand that if we know that our vector will have approximately/at least 1000 entries, but aren’t certain, then we can improve the performance of the push operations by using sizehint!(). In practice, it seems like this is mainly used for dictionaries, heaps, and other higher-level data structures rather than vectors of numbers.

My mental model of how sizehint!() works is as follows:

If I call a = zeros(Float64, 1000), Julia allocates 1000 blocks of memory, fills them with zeros, and then “labels” them with the variable name a and a header that says “a is a 1000-element vector whose entries are stored over here.”
If I call a = Float64[]; sizehint!(a, 1000), Julia allocates 1000 blocks of memory, builds a “fence” around them that says that other variables shouldn’t have their values stored here, and then “labels” this with the variable name a and a header that says “a is a 0-element vector whose entries are stored over here.”

Now if I call push!(a, 0.0) 1000 times, the memory needed for the push is already “right next door,” so without iterating over the whole of a, Julia can simply increment its size and write the value of 0.5 to the end block in O(1)-time.

If my mental model is correct, then the function bar() below should be faster than the function foo(), because while both functions allocate m Float64-sized blocks, the first one needlessly fills them with zeros before overwriting them to be 0.5, whereas the second does not. The only additional computation required by bar() (in my mental model) is incrementing the size of a by 1 at each i, but this cannot be substantially more expensive than writing a bunch of zeros to the memory.

julia> function foo(m)
           a = zeros(Float64, m)
           for i in 1:m
               a[i] = 0.5
           end
       end
foo (generic function with 1 method)

julia> function bar(m)
           a = Float64[]
           sizehint!(a, m)
           for i in 1:m
               push!(a, 0.5)
           end
       end
bar (generic function with 1 method)

But the opposite is the case: bar() is nearly 4.5 times faster.

julia> using BenchmarkTools

julia> @benchmark foo(50)
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  66.278 ns … 567.468 ns  ┊ GC (min … max): 0.00% … 84.62%
 Time  (median):     83.435 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   90.296 ns ±  45.138 ns  ┊ GC (mean ± σ):  4.83% ±  8.40%

   ▃█▄▂▁▁                                                      ▁
  ▅████████▇▇▆▆▄▃▁▁▁▃▄█▇▅▅▅▄▅▄▁▃▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ █
  66.3 ns       Histogram: log(frequency) by time       471 ns <

 Memory estimate: 496 bytes, allocs estimate: 1.

julia> @benchmark bar(50)
BenchmarkTools.Trial: 10000 samples with 264 evaluations.
 Range (min … max):  294.227 ns …   5.683 μs  ┊ GC (min … max): 0.00% … 94.45%
 Time  (median):     300.496 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   325.789 ns ± 255.622 ns  ┊ GC (mean ± σ):  4.16% ±  5.02%

  ▃█▇▃▃                                   ▁▁                    ▁
  █████▇▅▆▅▂▄▆███▆▇▆▅▅▄▄▆▄▆▇▇▅▆▇▅▅▄▅▄▅▆▅▆▇██▇▅▄▅▇██▆▅▅▄▅▄▅▃▃▃▃▄ █
  294 ns        Histogram: log(frequency) by time        495 ns <

 Memory estimate: 512 bytes, allocs estimate: 2.

What’s going on here?

(The results are similar if you write rand() instead of 0.5.)

gbaraldi · March 31, 2022, 1:59am

Unfortunately it isn’t as simple as that, for a couple of reasons.
Zeroing the whole array isn’t that slow, it becomes a single memset call after the allocation. While in bar you first allocate an empty array and then extend it.
The bigger issue however is the loop. Indexing and setting a value is very quick, and in that case it can also utilize SIMD, which allows the CPU to set multiple values with a single instruction.
Push can’t do that, because on every iteration it has to check if it fits inside the container, while also just being slower than normal indexing.

Elrod · March 31, 2022, 2:00am

push! calls into C code and is thus not inlined and invisible to the optimizer.
So instead of being compiled into few SIMD instructions that perform many loop iterations at a time, you get calls into C code.

See:
https://github.com/JuliaLang/julia/issues/24909
https://github.com/tpapp/PushVectors.jl

Elrod · March 31, 2022, 2:01am

If it were implemented in Julia, the optimizer should be able to compile that away.

PushVector does quite well

julia> function buz(m)
           a = PushVector{Float64}()
           sizehint!(a, m)
           for i in 1:m
               push!(a, 0.5)
           end
       end
buz (generic function with 1 method)

julia> @benchmark foo(1000)
BenchmarkTools.Trial: 10000 samples with 40 evaluations.
 Range (min … max):  828.600 ns … 81.826 μs  ┊ GC (min … max):  0.00% … 97.52%
 Time  (median):     950.075 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.427 μs ±  2.356 μs  ┊ GC (mean ± σ):  10.48% ±  7.32%

   ▃█▆▃                                                 ▂▄▃▃▂  ▁
  ▆████▇▆▅▁▃▄▃▁▄▃▁▁▄▄▁▄▃▁▃▄▁▁▁▄▃▁▄▁▃▃▄▄▃▁▃▄▁▁▄▄▃▄▃▁▄▄▃▁▅██████ █
  829 ns        Histogram: log(frequency) by time      3.43 μs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

julia> @benchmark bar(1000)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.838 μs … 149.373 μs  ┊ GC (min … max): 0.00% … 94.18%
 Time  (median):     4.934 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.093 μs ±   3.397 μs  ┊ GC (mean ± σ):  2.22% ±  3.20%

       ▅▇▆█▆▃
  ▁▁▂▄███████▇▆▅▄▃▂▂▁▁▁▁▁▁▁▁▁▁▂▃▃▃▃▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁ ▂
  4.84 μs         Histogram: frequency by time        5.46 μs <

 Memory estimate: 7.94 KiB, allocs estimate: 2.

julia> @benchmark buz(1000)
BenchmarkTools.Trial: 10000 samples with 44 evaluations.
 Range (min … max):  952.136 ns … 24.023 μs  ┊ GC (min … max): 0.00% … 92.98%
 Time  (median):       1.004 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.103 μs ±  1.109 μs  ┊ GC (mean ± σ):  8.37% ±  7.73%

                ▃▆█▇▆▃▁
  ▁▂▃▂▂▂▂▁▁▁▂▄▅▇███████▇▅▄▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  952 ns          Histogram: frequency by time         1.13 μs <

 Memory estimate: 7.97 KiB, allocs estimate: 2.

However, it is slower, and it is not using SIMD instructions.

maxkapur · March 31, 2022, 2:04am

(Again, I am asking an annoying question in an effort to improve my mental model:) Doesn’t array indexing have to do the same?

julia> b = zeros(Float64, 10)
10-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

julia> b[3] = "a string"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at ~/julia-1.8.0-beta1/share/julia/base/number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at ~/julia-1.8.0-beta1/share/julia/base/number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at ~/julia-1.8.0-beta1/share/julia/base/twiceprecision.jl:273
  ...
Stacktrace:
 [1] setindex!(A::Vector{Float64}, x::String, i1::Int64)
   @ Base ./array.jl:966
 [2] top-level scope
   @ REPL[15]:1

gbaraldi · March 31, 2022, 2:05am

Ideally in the first case it didn’t need to call memset for zeros either, through calloc or just optimize it away. Even better if it could call memset for 0.5. Not sure however how fast is memset compared to a normal SIMDed loop.

gbaraldi · March 31, 2022, 2:06am

In lots of cases the compiler can prove that all indices are inbounds so it optimizes away the check.

Elrod · March 31, 2022, 2:07am

memset is faster at large sizes.
The implementation checks the size of the arrays, and will use non-temporal stores if it’s large enough to flush your cash anyway, for example.

maxkapur · March 31, 2022, 3:48am

So in summary: My mental model is not fundamentally wrong, but I didn’t know about Julia’s push!() implementation relying on calls to C under the hood, and this interaction with C code explains the performance difference above.

goerch · March 31, 2022, 9:03am

I’m yet missing the obligatory reference to Vector{T}(undef, N) in this thread?

Topic		Replies	Views
Can I force a vector to release its memory? General Usage	5	540	May 26, 2018
Documenting performance model of `empty!`, `sizehint!`, `push!` & friends Internals & Design proposal	14	3701	May 2, 2019
Define empty array with sizehint in single step General Usage	6	100	November 26, 2024
`sizehint!` style approach for matrices General Usage performance	2	312	November 8, 2022
Do we have a "sizeforce!"? Performance	2	243	February 12, 2023

Improving my mental model: Why is `sizehint!()` so much slower than `zeros()`?

Related topics