Improving my mental model: Why is `sizehint!()` so much slower than `zeros()`?

I am asking this question in an effort to improve my understanding of computer science, not to complain about Julia’s performance :wink:

I understand that, in principle, if we know that we are going to be filling an vector with exactly 1000 entries, it is better to simply declare it as an a 1000-element vector at the outset than to start with [] and push.

I also understand that if we know that our vector will have approximately/at least 1000 entries, but aren’t certain, then we can improve the performance of the push operations by using sizehint!(). In practice, it seems like this is mainly used for dictionaries, heaps, and other higher-level data structures rather than vectors of numbers.

My mental model of how sizehint!() works is as follows:

  • If I call a = zeros(Float64, 1000), Julia allocates 1000 blocks of memory, fills them with zeros, and then β€œlabels” them with the variable name a and a header that says "a is a 1000-element vector whose entries are stored over here."

  • If I call a = Float64[]; sizehint!(a, 1000), Julia allocates 1000 blocks of memory, builds a β€œfence” around them that says that other variables shouldn’t have their values stored here, and then β€œlabels” this with the variable name a and a header that says "a is a 0-element vector whose entries are stored over here."

    Now if I call push!(a, 0.0) 1000 times, the memory needed for the push is already β€œright next door,” so without iterating over the whole of a, Julia can simply increment its size and write the value of 0.5 to the end block in O(1)-time.

If my mental model is correct, then the function bar() below should be faster than the function foo(), because while both functions allocate m Float64-sized blocks, the first one needlessly fills them with zeros before overwriting them to be 0.5, whereas the second does not. The only additional computation required by bar() (in my mental model) is incrementing the size of a by 1 at each i, but this cannot be substantially more expensive than writing a bunch of zeros to the memory.

julia> function foo(m)
           a = zeros(Float64, m)
           for i in 1:m
               a[i] = 0.5
           end
       end
foo (generic function with 1 method)

julia> function bar(m)
           a = Float64[]
           sizehint!(a, m)
           for i in 1:m
               push!(a, 0.5)
           end
       end
bar (generic function with 1 method)

But the opposite is the case: bar() is nearly 4.5 times faster.

julia> using BenchmarkTools

julia> @benchmark foo(50)
BenchmarkTools.Trial: 10000 samples with 972 evaluations.
 Range (min … max):  66.278 ns … 567.468 ns  β”Š GC (min … max): 0.00% … 84.62%
 Time  (median):     83.435 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   90.296 ns Β±  45.138 ns  β”Š GC (mean Β± Οƒ):  4.83% Β±  8.40%

   β–ƒβ–ˆβ–„β–‚β–β–                                                      ▁
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–„β–ƒβ–β–β–β–ƒβ–„β–ˆβ–‡β–…β–…β–…β–„β–…β–„β–β–ƒβ–ƒβ–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–… β–ˆ
  66.3 ns       Histogram: log(frequency) by time       471 ns <

 Memory estimate: 496 bytes, allocs estimate: 1.

julia> @benchmark bar(50)
BenchmarkTools.Trial: 10000 samples with 264 evaluations.
 Range (min … max):  294.227 ns …   5.683 ΞΌs  β”Š GC (min … max): 0.00% … 94.45%
 Time  (median):     300.496 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   325.789 ns Β± 255.622 ns  β”Š GC (mean Β± Οƒ):  4.16% Β±  5.02%

  β–ƒβ–ˆβ–‡β–ƒβ–ƒ                                   ▁▁                    ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–†β–…β–‚β–„β–†β–ˆβ–ˆβ–ˆβ–†β–‡β–†β–…β–…β–„β–„β–†β–„β–†β–‡β–‡β–…β–†β–‡β–…β–…β–„β–…β–„β–…β–†β–…β–†β–‡β–ˆβ–ˆβ–‡β–…β–„β–…β–‡β–ˆβ–ˆβ–†β–…β–…β–„β–…β–„β–…β–ƒβ–ƒβ–ƒβ–ƒβ–„ β–ˆ
  294 ns        Histogram: log(frequency) by time        495 ns <

 Memory estimate: 512 bytes, allocs estimate: 2.

What’s going on here?

(The results are similar if you write rand() instead of 0.5.)

1 Like

Unfortunately it isn’t as simple as that, for a couple of reasons.
Zeroing the whole array isn’t that slow, it becomes a single memset call after the allocation. While in bar you first allocate an empty array and then extend it.
The bigger issue however is the loop. Indexing and setting a value is very quick, and in that case it can also utilize SIMD, which allows the CPU to set multiple values with a single instruction.
Push can’t do that, because on every iteration it has to check if it fits inside the container, while also just being slower than normal indexing.

1 Like

push! calls into C code and is thus not inlined and invisible to the optimizer.
So instead of being compiled into few SIMD instructions that perform many loop iterations at a time, you get calls into C code.

See:

4 Likes

If it were implemented in Julia, the optimizer should be able to compile that away.

PushVector does quite well

julia> function buz(m)
           a = PushVector{Float64}()
           sizehint!(a, m)
           for i in 1:m
               push!(a, 0.5)
           end
       end
buz (generic function with 1 method)

julia> @benchmark foo(1000)
BenchmarkTools.Trial: 10000 samples with 40 evaluations.
 Range (min … max):  828.600 ns … 81.826 ΞΌs  β”Š GC (min … max):  0.00% … 97.52%
 Time  (median):     950.075 ns              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):     1.427 ΞΌs Β±  2.356 ΞΌs  β”Š GC (mean Β± Οƒ):  10.48% Β±  7.32%

   β–ƒβ–ˆβ–†β–ƒ                                                 β–‚β–„β–ƒβ–ƒβ–‚  ▁
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–β–ƒβ–„β–ƒβ–β–„β–ƒβ–β–β–„β–„β–β–„β–ƒβ–β–ƒβ–„β–β–β–β–„β–ƒβ–β–„β–β–ƒβ–ƒβ–„β–„β–ƒβ–β–ƒβ–„β–β–β–„β–„β–ƒβ–„β–ƒβ–β–„β–„β–ƒβ–β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  829 ns        Histogram: log(frequency) by time      3.43 ΞΌs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

julia> @benchmark bar(1000)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.838 ΞΌs … 149.373 ΞΌs  β”Š GC (min … max): 0.00% … 94.18%
 Time  (median):     4.934 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.093 ΞΌs Β±   3.397 ΞΌs  β”Š GC (mean Β± Οƒ):  2.22% Β±  3.20%

       β–…β–‡β–†β–ˆβ–†β–ƒ
  β–β–β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β– β–‚
  4.84 ΞΌs         Histogram: frequency by time        5.46 ΞΌs <

 Memory estimate: 7.94 KiB, allocs estimate: 2.

julia> @benchmark buz(1000)
BenchmarkTools.Trial: 10000 samples with 44 evaluations.
 Range (min … max):  952.136 ns … 24.023 ΞΌs  β”Š GC (min … max): 0.00% … 92.98%
 Time  (median):       1.004 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.103 ΞΌs Β±  1.109 ΞΌs  β”Š GC (mean Β± Οƒ):  8.37% Β±  7.73%

                β–ƒβ–†β–ˆβ–‡β–†β–ƒβ–
  β–β–‚β–ƒβ–‚β–‚β–‚β–‚β–β–β–β–‚β–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  952 ns          Histogram: frequency by time         1.13 ΞΌs <

 Memory estimate: 7.97 KiB, allocs estimate: 2.

However, it is slower, and it is not using SIMD instructions.

2 Likes

(Again, I am asking an annoying question in an effort to improve my mental model:) Doesn’t array indexing have to do the same?

julia> b = zeros(Float64, 10)
10-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

julia> b[3] = "a string"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at ~/julia-1.8.0-beta1/share/julia/base/number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at ~/julia-1.8.0-beta1/share/julia/base/number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at ~/julia-1.8.0-beta1/share/julia/base/twiceprecision.jl:273
  ...
Stacktrace:
 [1] setindex!(A::Vector{Float64}, x::String, i1::Int64)
   @ Base ./array.jl:966
 [2] top-level scope
   @ REPL[15]:1

Ideally in the first case it didn’t need to call memset for zeros either, through calloc or just optimize it away. Even better if it could call memset for 0.5. Not sure however how fast is memset compared to a normal SIMDed loop.

In lots of cases the compiler can prove that all indices are inbounds so it optimizes away the check.

memset is faster at large sizes.
The implementation checks the size of the arrays, and will use non-temporal stores if it’s large enough to flush your cash anyway, for example.

2 Likes

So in summary: My mental model is not fundamentally wrong, but I didn’t know about Julia’s push!() implementation relying on calls to C under the hood, and this interaction with C code explains the performance difference above.

I’m yet missing the obligatory reference to Vector{T}(undef, N) in this thread?

2 Likes