How to preallocate and reuse buffers for oft-repeated computation?

In How to optimise Julia code: A practical guide, Jakob Nissen suggests that one way to optimize code + reduce allocations is to

  • Preallocate buffers once and reuse them multiple times

I’m wondering if someone has a walkthrough of an actual example comparing a “repeatedly allocate” solution to a “preallocate buffers and reuse” solution. I think I understand the idea, but I’m wondering how to follow the advice in practice.

For example, I have a calculation like the following:

struct ResultData
    value::Float64
    result_type::Bool
end

function calculate_and_use(x, y, z)
    if y > z
        temp = ResultData(y + z, true)
    else
        temp = ResultData(y * z, false)
    end
    if temp.result_type
        x + z
    else
        x - z
    end
end

Please ignore the implementation details—calculate_and_use is any involved calculation that involves an allocation of an object, here played by ResultData, which has some overhead.

Let’s say calculate_and_use has to be called millions of times by some function, so each time there’s an allocation of a ResultData object which is then thrown away and, voila, re-allocated a split second later. (Not that this seems to matter, but ResultData could actually be a fixed-length Vector{ResultData} or similar, but in any case: it’s created, used, and then thrown away.)

What would be the analogue of pre-allocating a buffer in this context? Would I have some global buf = ResultData(0.0,false) that just gets overwritten? (But this violates “Use immutable structs over mutable struct when mutation is not needed,” as well as violating the dictum not to use globals.)

If not, then what is the right way to follow the preallocation advice in this context?

Or have I misunderstood the preallocation advice…and it is only applicable in other contexts?

Any suggestions appreciated; thank you!

struct objects are not generally allocated on the heap, so you don’t need to worry about them. See the manual section on tracking allocations:

[…] heap allocations […] are typically needed for either mutable objects or for creating/growing variable-sized containers (such as Array or Dict, strings, or “type-unstable” objects whose type is only known at runtime). Allocating (or deallocating) such blocks of memory may require an expensive system call (e.g. via malloc in C), and they must be tracked for garbage collection. In contrast, immutable values like numbers (except bignums), tuples, and immutable structs can be stored much more cheaply, e.g. in stack or CPU-register memory, so one doesn’t typically worry about the performance cost of “allocating” them.

2 Likes

Fair enough and thanks; that’s interesting!

Alas I was being lazy and the reality is that several of the many, many objects I’m repeatedly allocating are Dicts and others are Arrays.

I apologize that I think I oversimplified in my efforts to make the MWE M enough to be legible, but I think I still have the question notwithstanding that helpful background on struct allocation.

Hard to say more without a specific example, but the generic advice is (a) avoid allocating temporary arrays (or other mutable data structures) at all in your inner loops (work in place, use tuples and staticarrays, ditch “vectorize everything” habits from Matlab and Numpy, …), and otherwise write in-place functions that act on a re-usable buffer array (passed as an argument, not as a global).

3 Likes

This is great advice.

I think my question could alternately be phrased as “what does work in place mean”? Like is there an example / blog post / similar somewhere that can show the “in-place” version side-by-side with the “not-in-place” version?

For example above, if we pretend temp is a mutable data structure/temporary array that’s not supposed to be used at all in an inner loop, what would the “in-place” alternative to temp be?

I definitely believe that doing things in-place would be better, but I’m trying to understand what that actually means concretely: swap values in a global object that holds the values, or create the placeholder outside the function and pass it as an argument, or…?

Please let me know if that makes sense.

There’s actually a Wikipedia page on the topic:

Keep in mind that only hot loops need to be optimized. Keep a profiler handy.

@stevengj @nsajko thanks both for the links. I do understand how to mutate an object passed as an argument.

I’m trying to ask about the paradigm. Is the ideal paradigm to generate the object to serve as the placeholder/buffer at the global level? One function-call up? Different advice for different situations? etc.

I figured if there were concrete examples of this being done for certain functions it would help inspire a solution. So–not a blog post saying “here’s how to use mutable versus immutable structs” but a blog post saying (e.g.) “I optimized an implementation of { the Black-Scholes algorithm / the Newton-Raphson method / quicksort } to reduce allocations, here’s what the code looks like in the original version with allocations versus the revised version without them.”

Thanks and apologies for the unclear question.

Just a silly example of converting an algorithm to in place would be

function add_one(arr)
    return arr .+ 1
end

# In Julia, we add `!` to the end of in place/ mutating function names by convention
function add_one!(arr)
    for i in eachindex(arr)
        arr[i] = arr[i] + 1
    end
end

Thanks all.

May I ask that we put this question on hold for a second and I’m going to provide a bigger MWE.

I’m clearly not explaining the question clearly. Stay tuned. :slight_smile:

Almost certainly not global. It just needs to be somewhere outside of your critical inner loop(s).

You can find some examples on this thread, with the most common patterns to avoid, and what to do: Common allocation mistakes

1 Like

Here is an artificial example on “how to optimize away allocations using in-place operations and buffers”:

A mental model I like to apply when optimizing allocations is to think about how one would write the same function in a language with manual memory management. Obviously, this only works if you know one well. I use C here and IMO everyone who programs in Julia should have touched C at least once before.

Consider

function compute_something(A::AbstractMatrix, vs::Vector{<:AbstractVector})
   result = zero(eltype(A))
   for v in vs
      x = A * v # this creates a temporary matrix in every loop iteration
      result += sum(x)
   end
   return result
end

“Mental transpilation to C” would give something like

double compute_something(double *A, double **vs, int An, int Am, int nvs) {
   double result = 0.0;
   for (int i = 0; i < nvs; i++) {
       double *vi = vs+i;
       double *x = malloc(sizeof(double) * An);
       // matrix multiplication
       for (int nn = 0; nn < An; nn++) {
          result += x[nn];
       }
       free(x); x = NULL;
   }
   return result;
}

where I left out any NULL checks for clarity.

If you know C then you realize that the above is not idomatic, because you malloc and free nvs times a vector. Doing so in C might not be a bottleneck for your algorithm, but you still risk a memory leak.
In Julia, you don’t risk the memory leak, but the malloc and free work then all needs to be done by the GC and having the GC to do the free is in general more expensive than in C (because it needs to check if there are any references left to any of the objects that have been malloced before, somehow like that…).

So what you want to do here then is to move the malloc, free outside the loop.
Doing this in Julia is as easy as rewriting the above as

using LinearAlgebra 

function compute_something(A::AbstractMatrix, vs::Vector{<:AbstractVector})
   result = zero(eltype(A))
   x = zeros(eltype(A), size(A)[1]) # make a buffer once
   for v in vs
      mul!(x, A, v) # inplace matrix multiplication from LinearAlgebra
      result += sum(x)
   end
   return result
end

And if for some reason the allocation of the temporary x becomes also too expensive wrt the matrix multiplications in the loop, then you might even want to allocate x before you call compute_something and just pass it on as a buffer, e.g.

function compute_something_w_buffer(A::AbstractMatrix, vs::Vector{<:AbstractVector}, x::AbstractVector)
   result = zero(eltype(A))
   for v in vs
      mul!(x, A, v)
      result += sum(x)
   end
   return result
end

Doing the above needs practice, in particular when it comes to realizing which operations allocate, which don’t allocate and which can and cannot be made to not allocate.
But there are a ton of tools out there that can help you figure that out, like BenchmarkTools, ProfileView, @code_warntype, Cthulhu etc.

Lastly, the first things I like to check when optimizing allocs are (which is just a condensed version of the performance tips from the Julia docs):

  • Do I perform any array operations like =,*,+,-,/ without a .(broadcasting)?
  • Do I perform any array slicing like x[1:4,:] without a @views? (IIRC optimizing this away only really works if you don’t pass your view to another function call that is not @inlined)

Usually not fixed by using in-place algorithms and buffers:

  • Are there any type instabilities?
  • Do I use any non-const globals?

I know that result = zero(eltype(A)) is not optimal, because the vs might have different eltypes too, but this was not the focus of this example.

3 Likes

This is excellent. I have much to digest and am doing so with excitement.

Thank you! This is exactly the sort of before-after comparison I was looking for.

I am going to test out some alternatives and see what happens. I don’t have a good sense of whether my analogue of your compute_something v2 (the one with x = zeros) would end up much slower than v3 (the one with x as an argument), mapped onto my problem.

I will test and report back. Thank you!

In the case I was trying to optimize, lots of profiling + editing + re-profiling etc. ended up revealing the creation of small arrays (mistake #3) as the binding issue. I switched over to StaticArrays wherever possible and achieved a significant speedup.

1 Like