Common allocation mistakes

@amrods, splitting here the question you posted.

1. Create a mutable array inside a loop, explicitly:

julia> function f(x)
         z = 0.
         for i in 1:10
           y = [ i+1, i+2 ]
           z += x[1]*y[1] + x[2]*y[2]
         end
         z
       end
f (generic function with 1 method)

julia> @btime f(x) setup=(x=rand(2)) 
  242.694 ns (10 allocations: 960 bytes)
110.46200068979707

As you can see, there are 10 allocations, which come from the line y = [i+1,i+2]. This is something that possibly the compiler could figure out, and solve it, but currently it does not. That can be solved in several ways (for example not creating at all the y vector and multiplying directly by the elements in the next line, but that would be artificial in a general case). The most clear way to solve that is use static arrays:

julia> using StaticArrays

julia> function f(x)
         z = 0.
         for i in 1:10
           y = @SVector [ i+1, i+2 ]
           z += x[1]*y[1] + x[2]*y[2]
         end
         z
       end
f (generic function with 1 method)

julia> @btime f(x) setup=(x=rand(2)) 
  5.526 ns (0 allocations: 0 bytes)
26.367283900704752

To understand what happened here, we need to understand that there is more than way to create a vector in the computer. The previous one created the vector in the “heap”, which is one type of memory. There the vector can be mutable, change size, etc, and has a pointer. When using static arrays, the vector will have a constant size and won’t have an address. It is created then in one of the faster temporary memories (stack, or processor cache). It is possible (particularly in these simple examples) that the compiler just eliminates the need of the vector completely by simplifying the code. It is more likely to get these optimizations if the vector is static as well.

This is one case where it is possible that the Julia compiler will, at some point, optimize the first loop as well in the same way, by figuring out that the y vector is local to the loop and can be static.

2. Creating vectors unintentionally with broadcasting

Probably one of the most common mistakes that Julia allows people to do is to create intermediate vectors by broadcasting:

julia> using LinearAlgebra

julia> x = rand(2); y = rand(2);

julia> function f(x,y)
         z = 0.
         for i in 1:10
           z += norm(x .* y) 
         end
         z
       end
f (generic function with 2 methods)

julia> @btime f(x,y) setup=(x=rand(2);y=rand(2))
  333.809 ns (10 allocations: 960 bytes)
5.583855623549817

Again, there are those 10 allocations and a lousy perfomance. The problem here is that x .* y, which is the element-by-element multiplication of x and y creates a new array (of course in this simple case we can compute x .* y only once outside the loop). Now, one solution is to use static arrays again for x and y:

julia> @btime f(x,y) setup=(x=rand(SVector{2,Float64});y=rand(SVector{2,Float64}))
  3.388 ns (0 allocations: 0 bytes)
9.108217887882278

That is optimal, but depends on the input types, thus it is on the hands of the user, and may be not generalizable. The function could, however, guarantee that at least only one intermediary vector is allocated:

julia> function f(x,y)
         z = 0.
         tmp = zeros(2)
         for i in 1:10
           @. tmp = x * y
           z += norm(tmp) 
         end
         z
       end
f (generic function with 2 methods)

julia> @btime f(x,y) setup=(x=rand(2);y=rand(2)) 
  147.463 ns (1 allocation: 96 bytes)
4.352897649964319

Note that there is ony one allocation, and that the time is 2x faster than the first code (still much slower than the static array version, but again, these are artificial examples to illustrate the allocations, in a real case it could be impossible to use static arrays). Here, the @. tmp = x * y does the element-by-element product but updated vector tmp instead of creating a new vector at every iteration of the loop.

But note that this function still allocates. If this function was to be called from another piece of code that does a loop, that single allocation would become a lot of allocations and would be problematic. Then you would like to have this function also allocation-free, and then it would be good to completely preallocate tmp outside f:

julia> function f(x,y,tmp)
         z = 0.
         for i in 1:10
           @. tmp = x * y
           z += norm(tmp) 
         end
         z
       end
f (generic function with 3 methods)

julia> @btime f(x,y,tmp) setup=(x=rand(2);y=rand(2);tmp=zeros(2))
  125.657 ns (0 allocations: 0 bytes)
1.0594243698772656

3. Functions that return (small) arrays of undetermined size

One it come down to optimizing performance, one has to take care of using the static approach as much as possible, such that mutable arrays are not created. Again, forgive me by the artificiality of the examples. Suppose that one wants a function that returns a vector containing half of the elements of the input vector:

julia> f(x) = [ x[i] for i in 1:div(length(x),2) ]
f (generic function with 1 method)

This, of course, will allocate a new vector, the output vector. If that is called from within a loop, that will be very bad:

julia> function loop(x)
         z = 0.
         for i in 1:10
           z += sum(f(x))
         end
         z
       end
loop (generic function with 1 method)

julia> @btime loop(x) setup=(x=rand(4))
  302.872 ns (10 allocations: 960 bytes)
18.00302470686425

How to solve that? We need to give the compiler all the possible information. Thus 1) let us define x as a static array, with a fixed size. 2) let us define f in such a way that the output of vector length is known at compile time:

julia> f(x::SVector{N,T}) where {N,T} = SVector{div(N,2),T}(ntuple(i->x[i],div(N,2)))
f (generic function with 2 methods)

julia> @btime loop(x) setup=(x=rand(SVector{4,Float64}))
  2.099 ns (0 allocations: 0 bytes)
6.298705903007199

Now things got more complex: f receives a static vector with specified length N and element type T. It returns also a static vector, but which is created with size div(N,2) (this is the integer division, returning an integer), same type T and the elements of x from the first half of the vector, collected in a tuple of compile-time known size. This is also known at compile time, because Julia specializes f to the type of input variable, and here the type of input includes the length of x.

4. Slices in Julia allocate

This is another very common issue new users encounter. The above example could have been written in a more simple way as:

f(x) = x[1:div(length(x),2)]

Thus, we want the slice of the vector x with half of its elements. Slice, in Julia, allocate new arrays by default:

julia> @btime f(x) setup=(x=rand(4));
  29.445 ns (1 allocation: 96 bytes)

That is solved using views:

julia> f(x) = @view(x[1:div(length(x),2)])
f (generic function with 1 method)

julia> @btime f(x) setup=(x=rand(4))
  2.520 ns (0 allocations: 0 bytes)
2-element view(::Vector{Float64}, 1:2) with eltype Float64:
 0.7250861856178934
 0.4313586640733562

But, of course, this is different from the previous function, since in the first case, mutating an element of the output array will not mutate the original array, and in the second case it will.

If anyone has other/new/better examples, let us know :slight_smile:

(edit: fixed benchmark calls)

57 Likes

Thank you so much for taking the time to write this. I will study these examples.

1 Like

A slightly less heavy-weight solution for number 1 uses a Tuple rather than a SVector:

function f2(x)
    z = 0.
        for i in 1:10
            y = (i+1, i+2)
            z += x[1]*y[1] + x[2]*y[2]
        end
    return z
end

EDIT: As @lmiq points out, SArray provides more of an AbstractArray and linear algebra interface on top of Tuples. Which to choose, Tuple or SArray, depends in part on how much of these facilities you need.

14 Likes

Indeed, just noticing that many operations are not available for tuples, thus that depends exactly on what is done with the array afterwards.

2 Likes

Why the difference between these 2 functions? I thought I was safe from allocations by broadcasting.

using BenchmarkTools
using Random

function myrand1!(A)
    for i in eachindex(A)
        A[i] = rand()
    end
    A
end

function myrand2!(A)
    A .= rand(size(A)...)
    A
end

A = zeros(5, 5)

julia> @btime myrand1!($A);
  79.524 ns (0 allocations: 0 bytes)

julia> @btime myrand2!($A);
  94.310 ns (1 allocation: 288 bytes)

The rand(size(A)...) by itself allocates a matrix on the right-hand side of that. This is one case either you use a loop or, if you need that random matrix more than once, preallocate it.

I am not sure if there is a one-liner for that, except if A is a static matrix:

julia> A = @SMatrix(zeros(5,5));

julia> function myrand2!(A)
           A = rand(typeof(A))
           A
       end
myrand2! (generic function with 1 method)

julia> myrand2!(A)
5Ă—5 SMatrix{5, 5, Float64, 25} with indices SOneTo(5)Ă—SOneTo(5):
 0.173026  0.850559   0.831545  0.421062  0.705017
 0.496171  0.653757   0.110728  0.214264  0.294655
 0.314909  0.334194   0.561288  0.7288    0.316461
 0.766899  0.843623   0.725023  0.394503  0.187029
 0.113319  0.0721517  0.541748  0.992827  0.392566

julia> @btime myrand2!($A);
  36.690 ns (0 allocations: 0 bytes)

2 Likes

You can just broadcast the rand() call:

julia> v = zeros(3);

julia> v .= rand.()
3-element Vector{Float64}:
 0.3759061148529399
 0.16151859904091448
 0.09441097989235558

julia> @. v = rand()
3-element Vector{Float64}:
 0.5675808474761124
 0.5474349597286634
 0.07658147808190585
8 Likes

I see, so the allocation is not because of the A = ... assignment but in the creation of arrays. Is that then a good rule of thumb: “Avoid creating arrays, prefer to do things at the scalar level.”?

Inside critical loops, yes. In general not necessarily, of course having arrays to work with can be beneficial, even for performance, depending on the algorithm used. The repeated creation of mutable arrays is what will kill performance.

4 Likes

Yes, it’s almost always best to write functions for elementwise/scalar input, then use broadcasting (or other higher-order constructs like map or array comprehensions) to apply them to data. When writing a function that requires one or more temporary arrays, I’ll often write an in-place version f!(tmp, args...) that takes the temporaries as arguments, then write a second version f(args...; tmp=setup_tmp()) = f!(tmp, args...) that creates the temporaries if they’re not provided.

4 Likes

Just to remember that thanks to C. Elrod, it hard to beat an explicit loop with @turbo when that applies.

2 Likes

Just one more question (for now :slight_smile:). I remember reading somewhere in Discourse that there are 2 meanings for =: one is assignment as A[1] = 1, another one is “pointer” as in A = rand(2, 2); B = A. Do you remember reading the same thing? Is there a section in the manual about that?

Not the manual, but I have annotated that:

https://m3g.github.io/JuliaNotes.jl/stable/assignment/

3 Likes

For simple-ish cases, I’ve had pretty good luck with LoopVectorization.jl’s built-in vmap/vmap! and its threaded friends vmapt/vmapt!.

*edit: there’s a tentative long-term plan to allow these transformations with @turbo map(f, args...): https://github.com/JuliaSIMD/LoopVectorization.jl/issues/154#issuecomment-727009228

3 Likes

I’ve tried using LoopVectorization before and I found it “brittle”. I think I will focus for now on avoiding allocations.

Another thing is that I am afraid that StaticArrays does not work with ForwardDiff or ComponentArrays. I am especially worried about abandoning ComponentArrays. I program complex models with many parts and trying different functional forms etc, and ComponentArrays saves me a lot of manual tinkering with large vectors of parameters.

1 Like

Just one small example where creating a temporary array helps:

julia> using BenchmarkTools, LoopVectorization

julia> function sum_sin_odd(x)
         s = 0.
         for i in eachindex(x)
           if isodd(i)
             s += sin(x[i])
           end
         end
         return s
       end
sum_sin_odd (generic function with 1 method)

julia> function sum_sin_odd2(x)
         y = [ x[i] for i in 1:2:length(x) ]
         s = 0.
         @turbo for i in eachindex(y)
           s += sin(y[i])
         end
         return s
       end
sum_sin_odd2 (generic function with 1 method)

julia> @btime sum_sin_odd($x)
  3.453 ms (0 allocations: 0 bytes)
229765.60808923544

julia> @btime sum_sin_odd2($x)
  1.495 ms (2 allocations: 3.81 MiB)
229765.60808923293

2 Likes

Does LoopVectorization work on different processor architectures with similar performance? say on Intel, mac silicone, etc.

The short answer is it can. New architectures will require some tuning for good performance, however.

1 Like

Is there any tool that allows me to see where allocations are happening?

The Profile stdlib and its accompanying doc page is one way to profile allocations.

4 Likes