Working with lots of vectors of variable length and contents so I’m using the push! and append! methods frequently in my code. Using @benchmark from BenchmarkTools I’ve identified garbage collection associated with these methods to be a significant bottleneck in my code - @benchmark shows that about 80% of my total runtime is spent on garbage collection, on average (mean % GC), and using @profview from Profile identifies these methods as being the main culprits.
As a MWE, I’m using this motif a lot in my code - using append! to combine lots of smaller vectors into one larger vector:
function motif(v::Vector{Vector{Float64}})
out = Float64[]
for vect in v
append!(out, vect)
end
out
end
then benchmarking gives:
using BenchmarkTools, Random
vects = [rand(10) for i = 1:10]
@benchmark motif($vects)
All the garbage collection is from continually needing to reallocate out every time it gets bigger.
In this specific MWE, one could simply call reduce(vcat,v) to concatenate all the vectors.
More generally (as I imagine your case is), one can use sizehint! to suggest that out reserve a certain amount of space up front, so that it is not “surprised” every time it gets a new vector and needs to grow again.
function motif(v::Vector{Vector{Float64}})
finallength = sum(length, v) # predict how big `out` will get
out = sizehint!(Float64[], finallength)
for vect in v
append!(out, vect)
end
out
end
Even faster here than using append! continiously would be an explificit loop on a pre-allocated vector if you need ultimate speed.
function motif(v::Vector{Vector{Float64}})
out = Float64[]
sizehint!(out, sum(length, v))
for vect in v
append!(out, vect)
end
out
end
function motif2(v::Vector{Vector{Float64}})
out = Vector{Float64}(undef, sum(length, v))
offset = 0
for vect in v
for i ∈ eachindex(vect)
@inbounds out[offset + i] = vect[i]
end
offset += lastindex(vect)
end
out
end
I hate it when Base is losing in these micro benchmarks like:
using BenchmarkTools
function motif(v::Vector{Vector{Float64}})
out = Float64[]
sizehint!(out, sum(length, v))
for vect in v
append!(out, vect)
end
out
end
function motif2(v::Vector{Vector{Float64}})
out = Vector{Float64}(undef, sum(length, v))
offset = 0
for vect in v
for i ∈ eachindex(vect)
@inbounds out[offset + i] = vect[i]
end
offset += lastindex(vect)
end
out
end
function motif3(v)
vcat(v...)
end
vects = [rand(10) for i = 1:10]
@btime motif($vects)
@btime motif2($vects)
@btime motif3($vects)
Weirdly struggling to replicate the performance in motif2:
using LoopVectorization, BenchmarkTools
function motif2(v::Vector{Vector{Float64}})
out = Vector{Float64}(undef, sum(length, v))
offset = 0
for vect in v
for i ∈ eachindex(vect)
@inbounds out[offset + i] = vect[i]
end
offset += lastindex(vect)
end
out
end
vects = [rand(10) for i = 1:10]
@benchmark motif2($vects)
function motif(v::Vector{Vector{Float64}})
finallength = sum(length, v) # predict how big `out` will get
out = sizehint!(Float64[], finallength)
for vect in v
append!(out, vect)
end
out
end
@benchmark motif($vects)
GC is “just” code too, and pretty branchy at that. If your CPU is overall slower/worse at predicting those branches, having slower GC should not be surprising.
e.g. I get the ~2x speed-up if I’m only interested in the minimum execution time, although in absolute terms this is slower than other examples in this thread:
Okay, but it’s being called frequently enough and long enough to become a bottleneck (at least on my machine):
julia> function bench(v)
for i in 1:500000
motif(v)
end
end
bench (generic function with 1 method)
julia> @time bench(vects)
0.950454 seconds (1.00 M allocations: 419.810 MiB, 83.13% gc time, 0.57% compilation time)
julia> @time bench(vects)
0.907867 seconds (1000.00 k allocations: 419.617 MiB, 82.16% gc time)
julia> @time bench(vects)
0.911150 seconds (1000.00 k allocations: 419.617 MiB, 82.75% gc time)
julia> GC.enable(false)
true
julia> @time bench(vects)
0.205293 seconds (1000.00 k allocations: 419.617 MiB)
julia> @time bench(vects)
0.208905 seconds (1000.00 k allocations: 419.617 MiB)
julia> @time bench(vects)
0.193949 seconds (1000.00 k allocations: 419.617 MiB)
Any idea what’s going on here, or how to bypass it?
The use-case for this is to execute motif thousands of times within a given simulation, and run those simulations on individual CPUs on a HPC cluster. These CPUs are slower than that on my machine, so if this GC overhead is due to CPU speed then it could be a problem
It is surprising to me that the GC takes 8% on my machine and 82% on your machine… Do you have a standard Julia install? Anything else that could be specific on your system?
The way to bypass it is to not allocate so much garbage. You can for example pass in a “scratch array” to motif to reuse the memory of that array over and over:
function motif!(out::Vector{Float64}, v::Vector{Vector{Float64}})
resize!(out, 0)
for vect in v
append!(out, vect)
end
out
end
vects = [rand(10) for i = 1:10]
v = Float64[]
@benchmark motif!($v, $vects)
I was using v1.8.2 but there was an old v1.6.3 installation floating around too. Uninstalled both, and removed everything in c:/users/user/.julia using rm -rf in git bash, then re-installed v1.8.2 after a restart.
That seems to have fixed it:
julia> function motif(v::Vector{Vector{Float64}})
out = Float64[]
sizehint!(out, sum(length, v))
for vect in v
append!(out, vect)
end
out
end
motif (generic function with 1 method)
julia> function motif2(v::Vector{Vector{Float64}})
out = Vector{Float64}(undef, sum(length, v))
offset = 0
for vect in v
for i ∈ eachindex(vect)
@inbounds out[offset + i] = vect[i]
end
offset += lastindex(vect)
end
out
end
motif2 (generic function with 1 method)
julia> vects = [rand(10) for i = 1:10];
julia> @benchmark motif($vects)
BenchmarkTools.Trial: 10000 samples with 919 evaluations.
Range (min … max): 114.799 ns … 1.281 μs ┊ GC (min … max): 0.00% … 87.56%
Time (median): 124.483 ns ┊ GC (median): 0.00%
Time (mean ± σ): 139.924 ns ± 72.510 ns ┊ GC (mean ± σ): 6.37% ± 10.41%
██▆▄▂▁▁▁▁ ▂
███████████▇▆▅▆▅▅▅▅█▇▇▄▅▅▄▅▄▄▅▅▃▄▃▃▄▁▁▃▁▅▁▁▃▄▁▁▁▃▁▁▃▁▃▁▁▅▆██ █
115 ns Histogram: log(frequency) by time 614 ns <
Memory estimate: 880 bytes, allocs estimate: 2.
julia> @benchmark motif2($vects)
BenchmarkTools.Trial: 10000 samples with 984 evaluations.
Range (min … max): 61.585 ns … 1.003 μs ┊ GC (min … max): 0.00% … 29.76%
Time (median): 72.459 ns ┊ GC (median): 0.00%
Time (mean ± σ): 86.155 ns ± 62.217 ns ┊ GC (mean ± σ): 6.78% ± 9.24%
▆█▆▄▃▂▁ ▁ ▂
█████████▇▇▆▆▅▄▅▅▆█▆▆▅▅▅▆▇▆▇▅▅▄▄▅▄▄▄▁▄▅▄▄▆▆▆▆▅▄▄▄▄▅▄▃▃▄▃▁▃▆ █
61.6 ns Histogram: log(frequency) by time 493 ns <
Memory estimate: 896 bytes, allocs estimate: 1.