How best to preallocate memory for repeated use

Hello,

I’m trying to reduce memory allocations. The pattern is simple: I need to redo a similar problem that requires a fair bit of memory many times. So I’m preallocating sufficient memory for the largest problem and then reshaping that.

A minimal working example is below. Raw is the computation while reallocating, the others are attempts at making this run more efficiently. All improve things by about a factor of ten.

My question is: which of these methods is preferable (+why) and how can I improve on what’s here? Single-threaded benchmarks are at the bottom.

using Bumper, ManualMemory, BenchmarkTools

# Haven't tried ManualMemory yet, because there's no docs yet


function Compute💩( X )
    X .= 3.4
    return sum( X )
end




function Raw( R, C, N )
    Σ = 0.0
    for i ∈ 1:N
        X = zeros( R + i, C + i )
        Σ += Compute💩( X ) 
    end
    return Σ
end


function Reinterpret( R, C, N )
    M = zeros( (R+N ) * ( C + N ) )
    Σ = 0.0
    for i ∈ 1:N
@views        X = reshape( M[1:(R+i)*(C+i)], R+i, C+i )
        Σ += Compute💩( X ) 
    end
    return Σ
end



function Bump( R, C, N )
    Σ = @no_escape begin
        M = @alloc( Float64,  (R+N ) * ( C + N )  )
        Σ = 0.0
        for i ∈ 1:N
@views       X = reshape( M[1:(R+i)*(C+i)], R+i, C+i )
            Σ += Compute💩( X ) 
        end
        Σ
    end
    return Σ
end



function Wrap( R, C, N )
    M = pointer( zeros( (R+N) * (C+N) ) )
    Σ = 0.0
    GC.@preserve M begin
        for i ∈ 1:N
            X = Base.unsafe_wrap(Matrix{Float64}, M, (R+i, C+i))
            Σ += Compute💩( X )
        end
    end
    return Σ
end
julia> @benchmark Raw( 1000, 1000, 100 )
BenchmarkTools.Trial: 18 samples with 1 evaluation per sample.
 Range (min … max):  238.456 ms … 321.295 ms  ┊ GC (min … max): 12.18% … 18.40%
 Time  (median):     289.997 ms               ┊ GC (median):    17.93%
 Time  (mean ± σ):   287.377 ms ±  22.403 ms  ┊ GC (mean ± σ):  17.74% ±  1.46%

                                       █                      ▃  
  ▇▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▇▁▁▁▁▁▁▁▇▇▁█▁▇▁▁▇▇▇▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁█ ▁
  238 ms           Histogram: frequency by time          321 ms <

 Memory estimate: 842.59 MiB, allocs estimate: 300.

julia> @benchmark Reinterpret( 1000, 1000, 100 )
BenchmarkTools.Trial: 185 samples with 1 evaluation per sample.
 Range (min … max):  22.954 ms … 31.429 ms  ┊ GC (min … max): 0.00% … 6.65%
 Time  (median):     27.149 ms              ┊ GC (median):    1.38%
 Time  (mean ± σ):   27.077 ms ±  1.685 ms  ┊ GC (mean ± σ):  2.97% ± 3.52%

                         ▄▂▂ ▁▂▃▃▂      ▃█                     
  ▄▃▄▁▁█▆▄▃▅▅▄▁▁▁▁▃▁▃▁▁▁▁██████████▄▄▆▆▇██▆▆▅▃▅▄▄▁▃▃▃▃▃▁▁▃▁▁▃ ▃
  23 ms           Histogram: frequency by time        31.1 ms <

 Memory estimate: 9.23 MiB, allocs estimate: 3.

julia> @benchmark Bump( 1000, 1000, 100 )
BenchmarkTools.Trial: 193 samples with 1 evaluation per sample.
 Range (min … max):  25.215 ms …  31.155 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.598 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.938 ms ± 746.378 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▄▃▂                                                       
  ▃▃█████▆▅▅▃▃▃▃▁▃▃▄▅▃▄▄▄▃▅▆▆▄▄▄▃▃▃▃▁▁▁▃▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃ ▃
  25.2 ms         Histogram: frequency by time         28.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Wrap( 1000, 1000, 100 )
BenchmarkTools.Trial: 190 samples with 1 evaluation per sample.
 Range (min … max):  21.764 ms … 35.432 ms  ┊ GC (min … max): 0.00% … 0.95%
 Time  (median):     26.827 ms              ┊ GC (median):    1.46%
 Time  (mean ± σ):   26.329 ms ±  2.484 ms  ┊ GC (mean ± σ):  3.08% ± 3.80%

               ▁█           ▃▆▄▃                               
  ▄▆▄▃▄▃▆▄▄▁▃▅███▄▄▆▄▄▄▄▄▃▄▄█████▄▄▁▄▃▄▁█▄▄▄▆▃▄▁▃▄▃▁▁▃▁▁▁▁▁▁▃ ▃
  21.8 ms         Histogram: frequency by time        33.1 ms <

 Memory estimate: 9.24 MiB, allocs estimate: 302.
2 Likes

I can’t comment on the usage pattern, as I’m not familiar with Bumper, but I’d suggest to look with tests and @code_lowered & friends to see what the code does.

From my C++ days with custom arena allocators, it’s possible (in edge cases) to get misaligned memory, which can be either undefined behavior or a performance hit. (not saying that that’s the case, just something to look out for). If you plan to allocate structs on this memory, that can be an issue (? maybe, in C++ it is, I don’t know the rules in Julia + custom structs + arena).

Your benchmarks are interesting in that, in principle, the compiler should be able to figure out that your sum can be precomputed using reduce(size(A)…, *) * constant, so it could compute it without any array access, so your code may not be benchmarking the memory access pattern.

It would probably still be forced to do the assignment as a side effect, so the benchmark would be testing the assignment and allocation, but not necessarily the indexed compute + strided access.
@code_llvm’ and ‘@code_lowered’ would show if it’s doing that.

Thanks for sharing this question, I didn’t know Bumper was available, and seems like a well developed interface to arena allocators.

1 Like

Reinterpret, Bump, and Wrap have virtually identical performance, so I recommend using the native Julia implementation of Reinterpret:

  • Zero dependencies. Greatly reducing both the learning curve and ongoing maintenance.
  • Equivalent overhead. Bumper.jl must perform a buffer lookup and macro expansion on each call, whereas Reinterpret only incurs a single GC allocation for the array M. These overheads are essentially on par.
2 Likes

Thank you @karei and @benc .

Bumper only works for isbits types, so that presumably rules out structs. And for the program I posted, there indeed is not much point using Bumper.

It seems to me that, absent preallocation, the benefits of Bumper arise in examples like the one below, i.e. ones in which the garbage collector has to do a lot of work (if N is large).

using Bumper, BenchmarkTools

const a = rand(0:100, 1000000 )

function ComputeRaw💩( R, C )
    X = Matrix{Float64}(undef, R, C )
    for i ∈ eachindex( X )
        X[i] = rand()
    end
    return sum( X )
end

function ComputeBump💩( R, C )
    Σ = @no_escape begin
        X = @alloc( Float64, R* C )
        for i ∈ eachindex( X )
            X[i] = rand()
        end
        sum( X )
    end
    return Σ 
end


function Raw( R, C, N, a )
    Σ = 0.0
    for i ∈ 1:N
        Σ += ComputeRaw💩( R+a[i], C+a[i] ) 
    end
    return Σ
end


function Bump( R, C, N, a )
    Σ = 0.0
    for i ∈ 1:N
        Σ += ComputeBump💩( R+a[i], C+a[i] ) 
    end
    return Σ
end
1 Like

A lot of structs are isbits, just check with isbitstype.

2 Likes

Learning something new every day.