How much of a difference does avoiding allocations make?

jules · May 24, 2019, 9:12pm

I was wondering how much use it is to try and prohibit allocations in different scenarios. One case that came to mind was taking the median of many equally sized arrays. I was doing this with an array of arrays:

medians = [median(arr) for arr in arrays]

The median function sorts an array so it copies the original first, then it allocates memory of the same size as the array, and then it modifies that copy, or it modifies the original in-place with the ! version. I wondered how much benefit you get from only allocating one “median array” and reusing that memory, like this:

# some test data
arrays = [rand(10000) for i in 1:10000];

function medians_reuse_memory(arrs)
    medians = Vector{Float64}(undef, length(arrs))

    # only allocation is the first copy
    allocated = copy(arrs[1])
    
    for i in eachindex(arrs)
        if i > 1
            # reuse memory
            allocated[:] = arrs[i]
        end
        # modify in-place
        medians[i] = median!(allocated)        
    end

    medians
    
end

function medians_alloc_memory(arrs)
    medians = Vector{Float64}(undef, length(arrs))

    for i in eachindex(arrs)
        medians[i] = median(arrs[i]) 
    end

    medians
end

The benchmarks were like this:

julia> @benchmark medians_reuse_memory(arrays)
BenchmarkTools.Trial: 
  memory estimate:  156.41 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.692 s (0.00% GC)
  median time:      1.850 s (0.00% GC)
  mean time:        1.835 s (0.00% GC)
  maximum time:     1.963 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> @benchmark medians_alloc_memory(arrays)
BenchmarkTools.Trial: 
  memory estimate:  763.78 MiB
  allocs estimate:  20002
  --------------
  minimum time:     1.923 s (2.02% GC)
  median time:      1.963 s (2.18% GC)
  mean time:        1.960 s (2.16% GC)
  maximum time:     1.994 s (2.28% GC)
  --------------
  samples:          3
  evals/sample:     1

So you can see the amount of memory allocated in version 2 is huge but the time spent in the function not drastically different from version 1. Is this because allocating memory is so fast or is the garbage collection mechanism intelligent enough to reuse space that has just been used and that happens to have the correct size for a new allocation?

Is coding like in the first function advisable or will there not be any significant gains because of the compiler and given that the structure of the code is more complicated?

stevengj · May 24, 2019, 9:39pm

Memory allocation is expensive compared to things like adding scalars, so you don’t want it in tight inner loops. It is not expensive compared to sorting 10000 numbers, so it doesn’t make as much difference to preallocate if the operation on the allocated data is expensive.

juliohm · May 25, 2019, 12:50am

@stevengj could you please clarify your answer? Does it imply that GiB of data being allocated in a call that is not in a tight inner loop do not really matter?

I’ve recently submitted a PR to NearestNeighbors.jl precisely because of that issue: WIP: Add non-allocating in-place knn! by juliohm · Pull Request #84 · KristofferC/NearestNeighbors.jl · GitHub Seeing that much memory being allocated feels strange at least. Don’t we gain anything by saving memory allocations?

baggepinnen · May 25, 2019, 3:24am

If you want to thread the code, allocations become more of a problem in my experience. All threads share the same memory and will be competing with each other to allocate memory. This will become more of a problem, possibly since the inner cache will be flushed for all threads all the time.

stevengj · May 25, 2019, 4:03am

Well, it depends on where the memory comes from. Sometimes, the computer runs out and needs to grow more from its eggs, and high-speed bits can take several microseconds to hatch. Even more for floating-point bits because of the pesticides needed to suppress roundoff bugs. On the plus side, however, the leftover shells can be quite useful if you have a terminal handy.

jules · May 25, 2019, 6:01am

That makes sense, I probably gained a false sense of “you have to avoid allocations” because that’s usually one of the first things people point out about other’s slow code in this forum.

Tamas_Papp · May 25, 2019, 6:21am

As pointed out above, this is mostly relevant for inner loops.

That said, there is a healthy culture of micro-optimization one-upmanship on this forum, which occasionally runs amok, so it is easy to get the impression that there is a standard “bag of tricks” that is absolutely necessary write fast Julia code.

I prefer to think of these this the opposite way: idomatic Julia code is reasonably fast out of the box, provided some pitfalls are avoided. 90% of the time it is not worth optimizing it further. The 10% should be identified by profiling and benchmarking.

ChrisRackauckas · May 25, 2019, 10:38am

It’s not that you shouldn’t allocate, it’s that you should allocate smartly. If your operations are O(n) on small arrays, then allocations may take just as long or longer than the actual operations, and so you get a good speedup by eliminating them. This is the standard case of “don’t allocate”.

However, you can extrapolate that too far. For example, if you are doing a matmul on larger arrays, the O(n^3) operation so completely dominates the runtime that in many cases allocating doesn’t make a difference to the runtime. And in fact there are cases where allocating decreases runtime, because the cost of the operation is so much more than the cost of the allocation. For example, if A is a sparse matrix, A' as a lazy matrix is non-allocating, but doing operations on A' can be slow because its memory is directly optimized for column-wise operations. In many cases, it may reduce runtime to allocate the sparse matrix for the transpose, and then use that in a few operations.

juliohm · May 25, 2019, 11:32am

But do we all agree that not allocating memory is better than allocating it? Not necessarily for performance but for saving energy, avoiding issues with multiple threads, etc. The only positive thing about allocating that comes to my mind is that code may look clearer.

ChrisRackauckas · May 25, 2019, 11:43am

No. Having a single cache to write into is something that’s not thread-safe, so allocating a little bit is a great way to avoid thread-safety issues as long as your computation is still compute-dominated. And why would this necessarily be “saving energy”? I gave an example where it clearly saves energy to allocate the transpose, and you can find a lot of examples where allocating will greatly decrease the amount of compute needed. It’s a major oversimplification to make a blanket statement that not allocating memory is better than allocating it.

juliohm · May 25, 2019, 11:53am

Thank you Chris I will try to reread the discussion above because I find it counter intuitive that allocating can be better than not allocating. When I mentioned saving energy I have in mind that any extra operation like reserving a space in memory requires more work. More energy. You are saying that this is not always true.

ChrisRackauckas · May 25, 2019, 12:01pm

It depends on how you’re using those values. If you’re often using some values, like a row, which are not contiguous in the original form, then you can save a lot of compute time by creating a vector so that way those values are contiguous. It depends on the complexity of what you’re doing to these sets of values.

baggepinnen · May 25, 2019, 12:14pm

Well yes, but this little bit should preferably not be in the loop. Rather once per thread before the loop than once per loop iteration.

louisponet · May 25, 2019, 1:06pm

Yes, this seems to be good common practice: allocate work buffers before starting the loop and allocate as little as possible inside. This becomes even more important for highly threaded applications since AFAIK the Julia gc is still singlethreaded and Amdahl’s law kills you very quickly in that situation.

ChrisRackauckas · May 25, 2019, 1:09pm

Yes, allocate outside the loop, but still allocate once per thread in that case. You’re allocating more than the serial computation, but it can be beneficial in this case if the computation is long enough.

Ronis_BR · May 25, 2019, 1:44pm

You guys have no idea how much I learned today reading this thread!

The argument of @ChrisRackauckas is very important. I faced something like this and I only realized what happened now. I had a model (I think it was a geomagnetic field model, but I can’t really remember), that when I reduced the allocations by 3x, I increased the runtime by 10%

In the end, I agree with @Tamas_Papp. There is not a magical formula for the ultimate speed. Use the general tips (type stability, etc.) and you are generally fine, reaching 90% of the best performance possible. That other 10% is likely very dependent of your code.

Topic		Replies	Views
Memory allocation within loop Performance question	7	781	June 28, 2020
Understanding meanings of memory allocation numbers New to Julia performance , memory-allocation	3	722	June 22, 2018
Why fewer memory allocations does not necessarily suggest higher speed New to Julia performance , memory-allocation	5	826	June 6, 2021
Threads memory allocations General Usage	2	506	January 22, 2020
Help to reduce number of allocations General Usage	23	5559	January 6, 2019

How much of a difference does avoiding allocations make?

Related topics