Unexpected garbage collection when using append!

Working with lots of vectors of variable length and contents so I’m using the push! and append! methods frequently in my code. Using @benchmark from BenchmarkTools I’ve identified garbage collection associated with these methods to be a significant bottleneck in my code - @benchmark shows that about 80% of my total runtime is spent on garbage collection, on average (mean % GC), and using @profview from Profile identifies these methods as being the main culprits.

As a MWE, I’m using this motif a lot in my code - using append! to combine lots of smaller vectors into one larger vector:

function motif(v::Vector{Vector{Float64}})

    out = Float64[]

    for vect in v

        append!(out, vect)

    end

    out

end

then benchmarking gives:

using BenchmarkTools, Random

vects = [rand(10) for i = 1:10]

@benchmark motif($vects)
julia> @benchmark motif($vects)
BenchmarkTools.Trial: 1102 samples with 719 evaluations.
 Range (min … max):  172.045 ns … 709.785 μs  ┊ GC (min … max):  0.00% … 99.96%
 Time  (median):     515.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     6.853 μs ±  66.829 μs  ┊ GC (mean ± σ):  93.28% ±  9.48%

  ▇▃                 ▂█▇▃▂▁▁   ▁
  ███▇▅▁▅▁▄▅▁▄▁▄▄▁▄▁▆█████████████▆▅▆▆▇▇▇▅▅▄▆▅▅▅▆▄▅▅▁▅▁▁▆▁▄▁▁▁▄ █
  172 ns        Histogram: log(frequency) by time       1.19 μs <

 Memory estimate: 1.97 KiB, allocs estimate: 4.

if I’m reading the output correctly, on average, ~90% of my total runtime in execution of motif is spent in garbage collection?

This is not what I’d expect, as append!, applied ‘naked’ is very fast:

julia> @benchmark append!($[1, 2, 3], $[4])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):   7.300 ns …  14.219 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      7.800 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.221 ns ± 167.705 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▅▃▂▁▁  ▂▄▃▂▁                                              ▁
  ████████████████▇▆▆▆▆▇▆▃▄▅▅▃▄▄▄▅▅▃▅▄▄▄▃▅▅▄▅▅▃▃▃▃▄▄▄▃▂▃▄▃▃▂▃▅ █
  7.3 ns        Histogram: log(frequency) by time      19.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

What am I messing up in motif?

All the garbage collection is from continually needing to reallocate out every time it gets bigger.

In this specific MWE, one could simply call reduce(vcat,v) to concatenate all the vectors.

More generally (as I imagine your case is), one can use sizehint! to suggest that out reserve a certain amount of space up front, so that it is not “surprised” every time it gets a new vector and needs to grow again.

function motif(v::Vector{Vector{Float64}})

    finallength = sum(length, v) # predict how big `out` will get
    out = sizehint!(Float64[], finallength)

    for vect in v

        append!(out, vect)

    end

    out

end
4 Likes

Even faster here than using append! continiously would be an explificit loop on a pre-allocated vector if you need ultimate speed.

function motif(v::Vector{Vector{Float64}})
    out = Float64[]
    sizehint!(out, sum(length, v))
    for vect in v
        append!(out, vect)
    end
    out
end
function motif2(v::Vector{Vector{Float64}})
    out = Vector{Float64}(undef, sum(length, v))
    offset = 0
    for vect in v
        for i ∈ eachindex(vect)
            @inbounds out[offset + i] = vect[i]
        end
        offset += lastindex(vect)
    end
    out
end
#+begin_src julia
vects = [rand(10) for i = 1:10]

@benchmark motif($vects)
#+end_src

#+RESULTS:
: BenchmarkTools.Trial: 10000 samples with 937 evaluations.
:  Range (min … max):  113.479 ns …  1.205 μs  ┊ GC (min … max): 0.00% … 88.93%
:  Time  (median):     126.926 ns              ┊ GC (median):    0.00%
:  Time  (mean ± σ):   139.781 ns ± 94.683 ns  ┊ GC (mean ± σ):  8.96% ± 11.14%
: 
:   ▇█▁                                                          ▁
:   ███▅▅▁▄▄▅▆▅█▅▆▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▄▆▄▅▅▆▇▇ █
:   113 ns        Histogram: log(frequency) by time       811 ns <
: 
:  Memory estimate: 880 bytes, allocs estimate: 2.
#+begin_src julia
@benchmark motif2($vects)
#+end_src

#+RESULTS:
: BenchmarkTools.Trial: 10000 samples with 984 evaluations.
:  Range (min … max):  60.711 ns …  1.000 μs  ┊ GC (min … max):  0.00% … 76.75%
:  Time  (median):     65.478 ns              ┊ GC (median):     0.00%
:  Time  (mean ± σ):   82.285 ns ± 87.815 ns  ┊ GC (mean ± σ):  12.95% ± 11.09%
: 
:   █▄▃▃▂                                                       ▁
:   █████▅▄▄▇█▅▄▃▁▆▇▆▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▆▆▆▅▅▅▅▆ █
:   60.7 ns      Histogram: log(frequency) by time       689 ns <
: 
:  Memory estimate: 896 bytes, allocs estimate: 1.
2 Likes

I hate it when Base is losing in these micro benchmarks like:

using BenchmarkTools

function motif(v::Vector{Vector{Float64}})
    out = Float64[]
    sizehint!(out, sum(length, v))
    for vect in v
        append!(out, vect)
    end
    out
end
function motif2(v::Vector{Vector{Float64}})
    out = Vector{Float64}(undef, sum(length, v))
    offset = 0
    for vect in v
        for i ∈ eachindex(vect)
            @inbounds out[offset + i] = vect[i]
        end
        offset += lastindex(vect)
    end
    out
end
function motif3(v)
    vcat(v...)
end

vects = [rand(10) for i = 1:10]
@btime motif($vects)
@btime motif2($vects)
@btime motif3($vects)

resulting in

  171.154 ns (2 allocations: 880 bytes)
  99.471 ns (1 allocation: 896 bytes)
  153.222 ns (1 allocation: 896 bytes) 

Weirdly struggling to replicate the performance in motif2:

using LoopVectorization, BenchmarkTools

function motif2(v::Vector{Vector{Float64}})
    out = Vector{Float64}(undef, sum(length, v))
    offset = 0
    for vect in v
        for i ∈ eachindex(vect)
            @inbounds out[offset + i] = vect[i]
        end
        offset += lastindex(vect)
    end
    out
end

vects = [rand(10) for i = 1:10]

@benchmark motif2($vects)

yields

julia> @benchmark motif2($vects)
BenchmarkTools.Trial: 2283 samples with 964 evaluations.
 Range (min … max):   82.676 ns … 293.815 μs  ┊ GC (min … max):  0.00% … 99.92%
 Time  (median):     262.241 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     2.359 μs ±  24.501 μs  ┊ GC (mean ± σ):  89.92% ±  8.59%

  █▃▃▂ ▁          ▄▇▅▅▄▃▄▃▃▂▁▁▁▁                                ▁
  ███████▆▆▆▇▆▆▅▅▆████████████████▇▇█▇▇▇▆▇▇█▆▆▇▆▁▆▅▄▆▄▆▆▁▄▄▁▄▄▆ █
  82.7 ns       Histogram: log(frequency) by time        706 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

Any ideas what could cause this?

System specs:

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf26 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 16 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 16

Get the same problem using mikmoore’s solution:

function motif(v::Vector{Vector{Float64}})

    finallength = sum(length, v) # predict how big `out` will get
    out = sizehint!(Float64[], finallength)

    for vect in v

        append!(out, vect)

    end

    out

end

@benchmark motif($vects)
julia> @benchmark motif($vects)
BenchmarkTools.Trial: 2433 samples with 873 evaluations.
 Range (min … max):  124.170 ns … 394.300 μs  ┊ GC (min … max):  0.00% … 99.94%
 Time  (median):     319.931 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     2.370 μs ±  25.112 μs  ┊ GC (mean ± σ):  85.49% ±  8.08%

  ▆█▁     ▁ ▄▁
  ███▆▄▄▃▃█▇██▇█▆▆▆▅▆▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▂▃▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▁▂▁▂▂▂▂▁▂ ▃
  124 ns           Histogram: frequency by time         1.24 μs <

 Memory estimate: 880 bytes, allocs estimate: 2.

Looks like your CPU just isn’t as fast.

Why does that increase the % GC?

I also tried to run your code and I got

julia> @benchmark motif($vects)
BenchmarkTools.Trial: 10000 samples with 935 evaluations.
 Range (min … max):  113.711 ns … 804.605 ns  ┊ GC (min … max): 0.00% … 74.26%
 Time  (median):     118.448 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   127.055 ns ±  57.778 ns  ┊ GC (mean ± σ):  6.02% ± 10.36%

  █▃                                                          ▁ ▁
  ██▇▄▄▃▄▁▁▁▁▁▁▁▁▃▁▁▄▇▅▆▅▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
  114 ns        Histogram: log(frequency) by time        540 ns <

 Memory estimate: 880 bytes, allocs estimate: 2.

julia> @benchmark motif2($vects)
BenchmarkTools.Trial: 10000 samples with 985 evaluations.
 Range (min … max):  55.046 ns … 592.647 ns  ┊ GC (min … max): 0.00% … 68.80%
 Time  (median):     62.345 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   70.433 ns ±  49.878 ns  ┊ GC (mean ± σ):  8.14% ±  9.94%

  ▅█▁                                                          ▁
  ███▇▆▆▆▆▆▆▅▆▆▆▅▄▁▁▁▁▁▁▄▅▇▅▄▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆▅▆▆▆▆▆▆▅ █
  55 ns         Histogram: log(frequency) by time       414 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark motif3($vects)
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
 Range (min … max):   91.277 ns … 602.112 ns  ┊ GC (min … max): 0.00% … 67.17%
 Time  (median):      97.727 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   106.663 ns ±  53.868 ns  ┊ GC (mean ± σ):  5.69% ±  9.16%

  ██▃                                                           ▂
  █████▇▅▅▆▅▆▆▆▅▆▆▁▁▁▁▁▁▃▇▆▁▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆▅▅▆▅▇▆▅▆ █
  91.3 ns       Histogram: log(frequency) by time        480 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.


julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 5700X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 1 on 16 virtual cores

Should add that restarting PC and changing to NUM_THREADS = 1 has no effect on my results

GC is “just” code too, and pretty branchy at that. If your CPU is overall slower/worse at predicting those branches, having slower GC should not be surprising.

But would I not expect my operations to be uniformly slower, and my % GC to remain constant, although the wall time to execute increases?

Should I also not still see a ~2× speed-up in motif2 compared to motif?

e.g. I get the ~2x speed-up if I’m only interested in the minimum execution time, although in absolute terms this is slower than other examples in this thread:

julia> @btime motif($vects)
  137.628 ns (2 allocations: 880 bytes)
julia> @btime motif2($vects)
  73.347 ns (1 allocation: 896 bytes)

but the mean time is ~ 2 μs for both

I don’t think that is what it shows.

julia> function bench(v)
           for i in 1:500000
               motif(v)
           end
       end;

julia> @time bench(vects);
  0.157026 seconds (2.00 M allocations: 961.496 MiB, 8.70% gc time, 2.73% compilation time)

julia> @time bench(vects);
  0.147299 seconds (2.00 M allocations: 961.304 MiB, 8.53% gc time)

julia> @time bench(vects);
  0.149719 seconds (2.00 M allocations: 961.304 MiB, 8.79% gc time)

julia> GC.enable(false) # disable the GC
true

julia> @time bench(vects);
  0.172612 seconds (2.00 M allocations: 961.304 MiB)

julia> @time bench(vects);
  0.175519 seconds (2.00 M allocations: 961.304 MiB)

julia> @time bench(vects);
  0.170377 seconds (2.00 M allocations: 961.304 MiB)

It might show that when a GC runs, it takes 90% of the time of that sample but the GC doesn’t run every time a sample is taken.

Okay, but it’s being called frequently enough and long enough to become a bottleneck (at least on my machine):

julia> function bench(v)
           for i in 1:500000
               motif(v)
           end
       end
bench (generic function with 1 method)

julia> @time bench(vects)
  0.950454 seconds (1.00 M allocations: 419.810 MiB, 83.13% gc time, 0.57% compilation time)

julia> @time bench(vects)
  0.907867 seconds (1000.00 k allocations: 419.617 MiB, 82.16% gc time)

julia> @time bench(vects)
  0.911150 seconds (1000.00 k allocations: 419.617 MiB, 82.75% gc time)

julia> GC.enable(false)
true

julia> @time bench(vects)
  0.205293 seconds (1000.00 k allocations: 419.617 MiB)

julia> @time bench(vects)
  0.208905 seconds (1000.00 k allocations: 419.617 MiB)

julia> @time bench(vects)
  0.193949 seconds (1000.00 k allocations: 419.617 MiB)

Any idea what’s going on here, or how to bypass it?

The use-case for this is to execute motif thousands of times within a given simulation, and run those simulations on individual CPUs on a HPC cluster. These CPUs are slower than that on my machine, so if this GC overhead is due to CPU speed then it could be a problem

It is surprising to me that the GC takes 8% on my machine and 82% on your machine… Do you have a standard Julia install? Anything else that could be specific on your system?

The way to bypass it is to not allocate so much garbage. You can for example pass in a “scratch array” to motif to reuse the memory of that array over and over:

function motif!(out::Vector{Float64}, v::Vector{Vector{Float64}})
    resize!(out, 0)
    for vect in v
        append!(out, vect)
    end
    out
end

vects = [rand(10) for i = 1:10]
v = Float64[]
@benchmark motif!($v, $vects)
1 Like

I was using v1.8.2 but there was an old v1.6.3 installation floating around too. Uninstalled both, and removed everything in c:/users/user/.julia using rm -rf in git bash, then re-installed v1.8.2 after a restart.

That seems to have fixed it:

julia> function motif(v::Vector{Vector{Float64}})
           out = Float64[]
           sizehint!(out, sum(length, v))
           for vect in v
               append!(out, vect)
           end
           out
       end
motif (generic function with 1 method)

julia> function motif2(v::Vector{Vector{Float64}})
           out = Vector{Float64}(undef, sum(length, v))
           offset = 0
           for vect in v
               for i ∈ eachindex(vect)
                   @inbounds out[offset + i] = vect[i]
               end
               offset += lastindex(vect)
           end
           out
       end
motif2 (generic function with 1 method)

julia> vects = [rand(10) for i = 1:10];

julia> @benchmark motif($vects)
BenchmarkTools.Trial: 10000 samples with 919 evaluations.
 Range (min … max):  114.799 ns …  1.281 μs  ┊ GC (min … max): 0.00% … 87.56%
 Time  (median):     124.483 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   139.924 ns ± 72.510 ns  ┊ GC (mean ± σ):  6.37% ± 10.41%

  ██▆▄▂▁▁▁▁                                                    ▂
  ███████████▇▆▅▆▅▅▅▅█▇▇▄▅▅▄▅▄▄▅▅▃▄▃▃▄▁▁▃▁▅▁▁▃▄▁▁▁▃▁▁▃▁▃▁▁▅▆██ █
  115 ns        Histogram: log(frequency) by time       614 ns <

 Memory estimate: 880 bytes, allocs estimate: 2.

julia> @benchmark motif2($vects)
BenchmarkTools.Trial: 10000 samples with 984 evaluations.
 Range (min … max):  61.585 ns …  1.003 μs  ┊ GC (min … max): 0.00% … 29.76%
 Time  (median):     72.459 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   86.155 ns ± 62.217 ns  ┊ GC (mean ± σ):  6.78% ±  9.24%

  ▆█▆▄▃▂▁ ▁                                                   ▂
  █████████▇▇▆▆▅▄▅▅▆█▆▆▅▅▅▆▇▆▇▅▅▄▄▅▄▄▄▁▄▅▄▄▆▆▆▆▅▄▄▄▄▅▄▃▃▄▃▁▃▆ █
  61.6 ns      Histogram: log(frequency) by time       493 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

Thanks all for the help!

That doesn’t really make sense how that could have solved it… Unless you replaced your default sysimage somehow.

1 Like

No idea of the specifics, just did the simple (on Windows 10) Apps & Features > Julia 1.8.2 then click ‘Uninstall’.