Hello,
I’m trying to reduce memory allocations. The pattern is simple: I need to redo a similar problem that requires a fair bit of memory many times. So I’m preallocating sufficient memory for the largest problem and then reshaping that.
A minimal working example is below. Raw
is the computation while reallocating, the others are attempts at making this run more efficiently. All improve things by about a factor of ten.
My question is: which of these methods is preferable (+why) and how can I improve on what’s here? Single-threaded benchmarks are at the bottom.
using Bumper, ManualMemory, BenchmarkTools
# Haven't tried ManualMemory yet, because there's no docs yet
function Compute💩( X )
X .= 3.4
return sum( X )
end
function Raw( R, C, N )
Σ = 0.0
for i ∈ 1:N
X = zeros( R + i, C + i )
Σ += Compute💩( X )
end
return Σ
end
function Reinterpret( R, C, N )
M = zeros( (R+N ) * ( C + N ) )
Σ = 0.0
for i ∈ 1:N
@views X = reshape( M[1:(R+i)*(C+i)], R+i, C+i )
Σ += Compute💩( X )
end
return Σ
end
function Bump( R, C, N )
Σ = @no_escape begin
M = @alloc( Float64, (R+N ) * ( C + N ) )
Σ = 0.0
for i ∈ 1:N
@views X = reshape( M[1:(R+i)*(C+i)], R+i, C+i )
Σ += Compute💩( X )
end
Σ
end
return Σ
end
function Wrap( R, C, N )
M = pointer( zeros( (R+N) * (C+N) ) )
Σ = 0.0
GC.@preserve M begin
for i ∈ 1:N
X = Base.unsafe_wrap(Matrix{Float64}, M, (R+i, C+i))
Σ += Compute💩( X )
end
end
return Σ
end
julia> @benchmark Raw( 1000, 1000, 100 )
BenchmarkTools.Trial: 18 samples with 1 evaluation per sample.
Range (min … max): 238.456 ms … 321.295 ms ┊ GC (min … max): 12.18% … 18.40%
Time (median): 289.997 ms ┊ GC (median): 17.93%
Time (mean ± σ): 287.377 ms ± 22.403 ms ┊ GC (mean ± σ): 17.74% ± 1.46%
█ ▃
▇▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▇▁▁▁▁▁▁▁▇▇▁█▁▇▁▁▇▇▇▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁█ ▁
238 ms Histogram: frequency by time 321 ms <
Memory estimate: 842.59 MiB, allocs estimate: 300.
julia> @benchmark Reinterpret( 1000, 1000, 100 )
BenchmarkTools.Trial: 185 samples with 1 evaluation per sample.
Range (min … max): 22.954 ms … 31.429 ms ┊ GC (min … max): 0.00% … 6.65%
Time (median): 27.149 ms ┊ GC (median): 1.38%
Time (mean ± σ): 27.077 ms ± 1.685 ms ┊ GC (mean ± σ): 2.97% ± 3.52%
▄▂▂ ▁▂▃▃▂ ▃█
▄▃▄▁▁█▆▄▃▅▅▄▁▁▁▁▃▁▃▁▁▁▁██████████▄▄▆▆▇██▆▆▅▃▅▄▄▁▃▃▃▃▃▁▁▃▁▁▃ ▃
23 ms Histogram: frequency by time 31.1 ms <
Memory estimate: 9.23 MiB, allocs estimate: 3.
julia> @benchmark Bump( 1000, 1000, 100 )
BenchmarkTools.Trial: 193 samples with 1 evaluation per sample.
Range (min … max): 25.215 ms … 31.155 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 25.598 ms ┊ GC (median): 0.00%
Time (mean ± σ): 25.938 ms ± 746.378 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▄▃▂
▃▃█████▆▅▅▃▃▃▃▁▃▃▄▅▃▄▄▄▃▅▆▆▄▄▄▃▃▃▃▁▁▁▃▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃ ▃
25.2 ms Histogram: frequency by time 28.4 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Wrap( 1000, 1000, 100 )
BenchmarkTools.Trial: 190 samples with 1 evaluation per sample.
Range (min … max): 21.764 ms … 35.432 ms ┊ GC (min … max): 0.00% … 0.95%
Time (median): 26.827 ms ┊ GC (median): 1.46%
Time (mean ± σ): 26.329 ms ± 2.484 ms ┊ GC (mean ± σ): 3.08% ± 3.80%
▁█ ▃▆▄▃
▄▆▄▃▄▃▆▄▄▁▃▅███▄▄▆▄▄▄▄▄▃▄▄█████▄▄▁▄▃▄▁█▄▄▄▆▃▄▁▃▄▃▁▁▃▁▁▁▁▁▁▃ ▃
21.8 ms Histogram: frequency by time 33.1 ms <
Memory estimate: 9.24 MiB, allocs estimate: 302.