How much of a difference does avoiding allocations make?

I was wondering how much use it is to try and prohibit allocations in different scenarios. One case that came to mind was taking the median of many equally sized arrays. I was doing this with an array of arrays:

medians = [median(arr) for arr in arrays]

The median function sorts an array so it copies the original first, then it allocates memory of the same size as the array, and then it modifies that copy, or it modifies the original in-place with the ! version. I wondered how much benefit you get from only allocating one “median array” and reusing that memory, like this:

# some test data
arrays = [rand(10000) for i in 1:10000];
function medians_reuse_memory(arrs)
    medians = Vector{Float64}(undef, length(arrs))

    # only allocation is the first copy
    allocated = copy(arrs[1])
    
    for i in eachindex(arrs)
        if i > 1
            # reuse memory
            allocated[:] = arrs[i]
        end
        # modify in-place
        medians[i] = median!(allocated)        
    end

    medians
    
end
function medians_alloc_memory(arrs)
    medians = Vector{Float64}(undef, length(arrs))

    for i in eachindex(arrs)
        medians[i] = median(arrs[i]) 
    end

    medians
end

The benchmarks were like this:

julia> @benchmark medians_reuse_memory(arrays)
BenchmarkTools.Trial: 
  memory estimate:  156.41 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.692 s (0.00% GC)
  median time:      1.850 s (0.00% GC)
  mean time:        1.835 s (0.00% GC)
  maximum time:     1.963 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> @benchmark medians_alloc_memory(arrays)
BenchmarkTools.Trial: 
  memory estimate:  763.78 MiB
  allocs estimate:  20002
  --------------
  minimum time:     1.923 s (2.02% GC)
  median time:      1.963 s (2.18% GC)
  mean time:        1.960 s (2.16% GC)
  maximum time:     1.994 s (2.28% GC)
  --------------
  samples:          3
  evals/sample:     1

So you can see the amount of memory allocated in version 2 is huge but the time spent in the function not drastically different from version 1. Is this because allocating memory is so fast or is the garbage collection mechanism intelligent enough to reuse space that has just been used and that happens to have the correct size for a new allocation?

Is coding like in the first function advisable or will there not be any significant gains because of the compiler and given that the structure of the code is more complicated?

1 Like

Memory allocation is expensive compared to things like adding scalars, so you don’t want it in tight inner loops. It is not expensive compared to sorting 10000 numbers, so it doesn’t make as much difference to preallocate if the operation on the allocated data is expensive.

7 Likes

@stevengj could you please clarify your answer? Does it imply that GiB of data being allocated in a call that is not in a tight inner loop do not really matter?

I’ve recently submitted a PR to NearestNeighbors.jl precisely because of that issue: WIP: Add non-allocating in-place knn! by juliohm · Pull Request #84 · KristofferC/NearestNeighbors.jl · GitHub Seeing that much memory being allocated feels strange at least. Don’t we gain anything by saving memory allocations?

1 Like

If you want to thread the code, allocations become more of a problem in my experience. All threads share the same memory and will be competing with each other to allocate memory. This will become more of a problem, possibly since the inner cache will be flushed for all threads all the time.

3 Likes

Well, it depends on where the memory comes from. Sometimes, the computer runs out and needs to grow more from its eggs, and high-speed bits can take several microseconds to hatch. Even more for floating-point bits because of the pesticides needed to suppress roundoff bugs. On the plus side, however, the leftover shells can be quite useful if you have a terminal handy.

8 Likes

That makes sense, I probably gained a false sense of “you have to avoid allocations” because that’s usually one of the first things people point out about other’s slow code in this forum.

As pointed out above, this is mostly relevant for inner loops.

That said, there is a healthy culture of micro-optimization one-upmanship on this forum, which occasionally runs amok, so it is easy to get the impression that there is a standard “bag of tricks” that is absolutely necessary write fast Julia code.

I prefer to think of these this the opposite way: idomatic Julia code is reasonably fast out of the box, provided some pitfalls are avoided. 90% of the time it is not worth optimizing it further. The 10% should be identified by profiling and benchmarking.

6 Likes

It’s not that you shouldn’t allocate, it’s that you should allocate smartly. If your operations are O(n) on small arrays, then allocations may take just as long or longer than the actual operations, and so you get a good speedup by eliminating them. This is the standard case of “don’t allocate”.

However, you can extrapolate that too far. For example, if you are doing a matmul on larger arrays, the O(n^3) operation so completely dominates the runtime that in many cases allocating doesn’t make a difference to the runtime. And in fact there are cases where allocating decreases runtime, because the cost of the operation is so much more than the cost of the allocation. For example, if A is a sparse matrix, A' as a lazy matrix is non-allocating, but doing operations on A' can be slow because its memory is directly optimized for column-wise operations. In many cases, it may reduce runtime to allocate the sparse matrix for the transpose, and then use that in a few operations.

4 Likes

But do we all agree that not allocating memory is better than allocating it? Not necessarily for performance but for saving energy, avoiding issues with multiple threads, etc. The only positive thing about allocating that comes to my mind is that code may look clearer.

No. Having a single cache to write into is something that’s not thread-safe, so allocating a little bit is a great way to avoid thread-safety issues as long as your computation is still compute-dominated. And why would this necessarily be “saving energy”? I gave an example where it clearly saves energy to allocate the transpose, and you can find a lot of examples where allocating will greatly decrease the amount of compute needed. It’s a major oversimplification to make a blanket statement that not allocating memory is better than allocating it.

3 Likes

Thank you Chris I will try to reread the discussion above because I find it counter intuitive that allocating can be better than not allocating. When I mentioned saving energy I have in mind that any extra operation like reserving a space in memory requires more work. More energy. You are saying that this is not always true.

It depends on how you’re using those values. If you’re often using some values, like a row, which are not contiguous in the original form, then you can save a lot of compute time by creating a vector so that way those values are contiguous. It depends on the complexity of what you’re doing to these sets of values.

1 Like

Well yes, but this little bit should preferably not be in the loop. Rather once per thread before the loop than once per loop iteration.

Yes, this seems to be good common practice: allocate work buffers before starting the loop and allocate as little as possible inside. This becomes even more important for highly threaded applications since AFAIK the Julia gc is still singlethreaded and Amdahl’s law kills you very quickly in that situation.

1 Like

Yes, allocate outside the loop, but still allocate once per thread in that case. You’re allocating more than the serial computation, but it can be beneficial in this case if the computation is long enough.

You guys have no idea how much I learned today reading this thread!

The argument of @ChrisRackauckas is very important. I faced something like this and I only realized what happened now. I had a model (I think it was a geomagnetic field model, but I can’t really remember), that when I reduced the allocations by 3x, I increased the runtime by 10% :sweat_smile:

In the end, I agree with @Tamas_Papp. There is not a magical formula for the ultimate speed. Use the general tips (type stability, etc.) and you are generally fine, reaching 90% of the best performance possible. That other 10% is likely very dependent of your code.

3 Likes