How best to preallocate memory for repeated use

Joris_Pinkse · July 17, 2025, 3:50pm

Hello,

I’m trying to reduce memory allocations. The pattern is simple: I need to redo a similar problem that requires a fair bit of memory many times. So I’m preallocating sufficient memory for the largest problem and then reshaping that.

A minimal working example is below. Raw is the computation while reallocating, the others are attempts at making this run more efficiently. All improve things by about a factor of ten.

My question is: which of these methods is preferable (+why) and how can I improve on what’s here? Single-threaded benchmarks are at the bottom.

using Bumper, ManualMemory, BenchmarkTools

# Haven't tried ManualMemory yet, because there's no docs yet


function Compute💩( X )
    X .= 3.4
    return sum( X )
end




function Raw( R, C, N )
    Σ = 0.0
    for i ∈ 1:N
        X = zeros( R + i, C + i )
        Σ += Compute💩( X ) 
    end
    return Σ
end


function Reinterpret( R, C, N )
    M = zeros( (R+N ) * ( C + N ) )
    Σ = 0.0
    for i ∈ 1:N
@views        X = reshape( M[1:(R+i)*(C+i)], R+i, C+i )
        Σ += Compute💩( X ) 
    end
    return Σ
end



function Bump( R, C, N )
    Σ = @no_escape begin
        M = @alloc( Float64,  (R+N ) * ( C + N )  )
        Σ = 0.0
        for i ∈ 1:N
@views       X = reshape( M[1:(R+i)*(C+i)], R+i, C+i )
            Σ += Compute💩( X ) 
        end
        Σ
    end
    return Σ
end



function Wrap( R, C, N )
    M = pointer( zeros( (R+N) * (C+N) ) )
    Σ = 0.0
    GC.@preserve M begin
        for i ∈ 1:N
            X = Base.unsafe_wrap(Matrix{Float64}, M, (R+i, C+i))
            Σ += Compute💩( X )
        end
    end
    return Σ
end

julia> @benchmark Raw( 1000, 1000, 100 )
BenchmarkTools.Trial: 18 samples with 1 evaluation per sample.
 Range (min … max):  238.456 ms … 321.295 ms  ┊ GC (min … max): 12.18% … 18.40%
 Time  (median):     289.997 ms               ┊ GC (median):    17.93%
 Time  (mean ± σ):   287.377 ms ±  22.403 ms  ┊ GC (mean ± σ):  17.74% ±  1.46%

                                       █                      ▃  
  ▇▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▇▁▁▁▁▁▁▁▇▇▁█▁▇▁▁▇▇▇▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁█ ▁
  238 ms           Histogram: frequency by time          321 ms <

 Memory estimate: 842.59 MiB, allocs estimate: 300.

julia> @benchmark Reinterpret( 1000, 1000, 100 )
BenchmarkTools.Trial: 185 samples with 1 evaluation per sample.
 Range (min … max):  22.954 ms … 31.429 ms  ┊ GC (min … max): 0.00% … 6.65%
 Time  (median):     27.149 ms              ┊ GC (median):    1.38%
 Time  (mean ± σ):   27.077 ms ±  1.685 ms  ┊ GC (mean ± σ):  2.97% ± 3.52%

                         ▄▂▂ ▁▂▃▃▂      ▃█                     
  ▄▃▄▁▁█▆▄▃▅▅▄▁▁▁▁▃▁▃▁▁▁▁██████████▄▄▆▆▇██▆▆▅▃▅▄▄▁▃▃▃▃▃▁▁▃▁▁▃ ▃
  23 ms           Histogram: frequency by time        31.1 ms <

 Memory estimate: 9.23 MiB, allocs estimate: 3.

julia> @benchmark Bump( 1000, 1000, 100 )
BenchmarkTools.Trial: 193 samples with 1 evaluation per sample.
 Range (min … max):  25.215 ms …  31.155 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.598 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.938 ms ± 746.378 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▄▃▂                                                       
  ▃▃█████▆▅▅▃▃▃▃▁▃▃▄▅▃▄▄▄▃▅▆▆▄▄▄▃▃▃▃▁▁▁▃▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃ ▃
  25.2 ms         Histogram: frequency by time         28.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Wrap( 1000, 1000, 100 )
BenchmarkTools.Trial: 190 samples with 1 evaluation per sample.
 Range (min … max):  21.764 ms … 35.432 ms  ┊ GC (min … max): 0.00% … 0.95%
 Time  (median):     26.827 ms              ┊ GC (median):    1.46%
 Time  (mean ± σ):   26.329 ms ±  2.484 ms  ┊ GC (mean ± σ):  3.08% ± 3.80%

               ▁█           ▃▆▄▃                               
  ▄▆▄▃▄▃▆▄▄▁▃▅███▄▄▆▄▄▄▄▄▃▄▄█████▄▄▁▄▃▄▁█▄▄▄▆▃▄▁▃▄▃▁▁▃▁▁▁▁▁▁▃ ▃
  21.8 ms         Histogram: frequency by time        33.1 ms <

 Memory estimate: 9.24 MiB, allocs estimate: 302.

benc · July 22, 2025, 8:20am

I can’t comment on the usage pattern, as I’m not familiar with Bumper, but I’d suggest to look with tests and @code_lowered & friends to see what the code does.

From my C++ days with custom arena allocators, it’s possible (in edge cases) to get misaligned memory, which can be either undefined behavior or a performance hit. (not saying that that’s the case, just something to look out for). If you plan to allocate structs on this memory, that can be an issue (? maybe, in C++ it is, I don’t know the rules in Julia + custom structs + arena).

Your benchmarks are interesting in that, in principle, the compiler should be able to figure out that your sum can be precomputed using reduce(size(A)…, *) * constant, so it could compute it without any array access, so your code may not be benchmarking the memory access pattern.

It would probably still be forced to do the assignment as a side effect, so the benchmark would be testing the assignment and allocation, but not necessarily the indexed compute + strided access.
‘@code_llvm’ and ‘@code_lowered’ would show if it’s doing that.

Thanks for sharing this question, I didn’t know Bumper was available, and seems like a well developed interface to arena allocators.

karei · July 22, 2025, 1:50pm

Reinterpret, Bump, and Wrap have virtually identical performance, so I recommend using the native Julia implementation of Reinterpret:

Zero dependencies. Greatly reducing both the learning curve and ongoing maintenance.
Equivalent overhead. Bumper.jl must perform a buffer lookup and macro expansion on each call, whereas Reinterpret only incurs a single GC allocation for the array M. These overheads are essentially on par.

Joris_Pinkse · July 22, 2025, 4:12pm

Thank you @karei and @benc .

Bumper only works for isbits types, so that presumably rules out structs. And for the program I posted, there indeed is not much point using Bumper.

It seems to me that, absent preallocation, the benefits of Bumper arise in examples like the one below, i.e. ones in which the garbage collector has to do a lot of work (if N is large).

using Bumper, BenchmarkTools

const a = rand(0:100, 1000000 )

function ComputeRaw💩( R, C )
    X = Matrix{Float64}(undef, R, C )
    for i ∈ eachindex( X )
        X[i] = rand()
    end
    return sum( X )
end

function ComputeBump💩( R, C )
    Σ = @no_escape begin
        X = @alloc( Float64, R* C )
        for i ∈ eachindex( X )
            X[i] = rand()
        end
        sum( X )
    end
    return Σ 
end


function Raw( R, C, N, a )
    Σ = 0.0
    for i ∈ 1:N
        Σ += ComputeRaw💩( R+a[i], C+a[i] ) 
    end
    return Σ
end


function Bump( R, C, N, a )
    Σ = 0.0
    for i ∈ 1:N
        Σ += ComputeBump💩( R+a[i], C+a[i] ) 
    end
    return Σ
end

Benny · July 22, 2025, 4:20pm

A lot of structs are isbits, just check with isbitstype.

Joris_Pinkse · July 22, 2025, 4:21pm

Learning something new every day.

Topic		Replies	Views
Controlling memory allocation using external struct as placeholder General Usage question	5	895	March 11, 2017
Reusing preallocated memory without unsafe wraps Performance multithreading , memory , memory-allocation	8	747	February 27, 2022
Pre-allocating Arrays for intermediate calculations General Usage	10	1501	March 12, 2019
Help to reduce number of allocations General Usage	23	5571	January 6, 2019
How much of a difference does avoiding allocations make? Performance	15	2120	May 25, 2019

How best to preallocate memory for repeated use

Related topics