Common allocation mistakes

Yes, it’s almost always best to write functions for elementwise/scalar input, then use broadcasting (or other higher-order constructs like map or array comprehensions) to apply them to data. When writing a function that requires one or more temporary arrays, I’ll often write an in-place version f!(tmp, args...) that takes the temporaries as arguments, then write a second version f(args...; tmp=setup_tmp()) = f!(tmp, args...) that creates the temporaries if they’re not provided.

3 Likes

Just to remember that thanks to C. Elrod, it hard to beat an explicit loop with @turbo when that applies.

2 Likes

Just one more question (for now :slight_smile:). I remember reading somewhere in Discourse that there are 2 meanings for =: one is assignment as A[1] = 1, another one is “pointer” as in A = rand(2, 2); B = A. Do you remember reading the same thing? Is there a section in the manual about that?

Not the manual, but I have annotated that:

https://m3g.github.io/JuliaNotes.jl/stable/assignment/

3 Likes

For simple-ish cases, I’ve had pretty good luck with LoopVectorization.jl’s built-in vmap/vmap! and its threaded friends vmapt/vmapt!.

*edit: there’s a tentative long-term plan to allow these transformations with @turbo map(f, args...): Boil down vmap! variants, extend to reduce & mapreduce · Issue #154 · JuliaSIMD/LoopVectorization.jl · GitHub

3 Likes

I’ve tried using LoopVectorization before and I found it “brittle”. I think I will focus for now on avoiding allocations.

Another thing is that I am afraid that StaticArrays does not work with ForwardDiff or ComponentArrays. I am especially worried about abandoning ComponentArrays. I program complex models with many parts and trying different functional forms etc, and ComponentArrays saves me a lot of manual tinkering with large vectors of parameters.

1 Like

Just one small example where creating a temporary array helps:

julia> using BenchmarkTools, LoopVectorization

julia> function sum_sin_odd(x)
         s = 0.
         for i in eachindex(x)
           if isodd(i)
             s += sin(x[i])
           end
         end
         return s
       end
sum_sin_odd (generic function with 1 method)

julia> function sum_sin_odd2(x)
         y = [ x[i] for i in 1:2:length(x) ]
         s = 0.
         @turbo for i in eachindex(y)
           s += sin(y[i])
         end
         return s
       end
sum_sin_odd2 (generic function with 1 method)

julia> @btime sum_sin_odd($x)
  3.453 ms (0 allocations: 0 bytes)
229765.60808923544

julia> @btime sum_sin_odd2($x)
  1.495 ms (2 allocations: 3.81 MiB)
229765.60808923293

1 Like

Does LoopVectorization work on different processor architectures with similar performance? say on Intel, mac silicone, etc.

The short answer is it can. New architectures will require some tuning for good performance, however.

1 Like

Is there any tool that allows me to see where allocations are happening?

The Profile stdlib and its accompanying doc page is one way to profile allocations.

4 Likes

https://m3g.github.io/JuliaNotes.jl/stable/memory/

2 Likes

Question, in function that create arrays inside of them, it is better to pass those arrays as arguments? For example:

function f1(x)
    sto = similar(x)
    sto = x .+ 1
    return sto
end

function f2(sto, x)
    sto = x .+ 1
    return sto
end

So that in a loop I can reuse sto without Julia having to allocate memory for every iteration:

function g1(x)
    for i in 1:100
        sto = f1(x)
    end
end

function g2(x)
    sto = similar(x)
    for i in 1:100
        f2(sto, x)
    end
    return sto
end

is that the point?

1 Like

Yes, usually what I do is to let this as an alternative to the user, like this:

function f1(x;aux=similar(x))
    aux .= x .+ 1 # note the broadcasting here!
    return aux
end

Then the function can be called only with x, in which case it allocates aux, or you can pass aux as a keyword parameter, in which it reuses what is given.

@amrods note that you need .= to broadcast, otherwise the preallocated vector is not used for anything (this is incorrect in the code you posted there).

3 Likes

oh nice, i didn’t know you could reference other arguments at function definition.

There is also Random.rand! which will fill the array with random values, and also make sure that they have the correct eltype. For example

A = rand(Int8, 3, 4);
rand!(A)

rand! will generate random Int8 values, etc.

3 Likes

These are great tips - thank you for writing this.

Once thing I recently learned related to views - and if I get any of this wrong please someone correct me - is that a view still allocates. For example, if you have a hot inner loop where you use small blocks of a larger vector you might think (or at least I naively did) that you can avoid heap allocations by using something like the following. I thought I could use the view to do allocation-free access and operations on elements of A.

A = rand(100)
B = rand(3)
function view_test(A, B)
    tmp = @view A[2:4]
    tmp .= tmp .* B
end

This is a silly example, but in my own hot inner loop I was reusing that view in places many times and re-creating each iteration of the loop. I ended up with tons of heap allocations everywhere that view was used (sorry if this is obvious to others!). I eventually found some posts indicating that if you have something like this you should just loop over the small block - which is what I ended up doing. In other locations with larger arrays I got around the problem by using copyto! or just made loops there as well.

That specifically should not allocate. But you can have allocations with views if the slice is of a non-contiguous memory block (not sure if this is true, what certainly happens is that views of non-contiguous slices are not very performant).

I get this when using @btime: 20.160 ns (1 allocation: 48 bytes)

What I obtain:

julia> @btime view_test($A, $B);
  169.941 ns (0 allocations: 0 bytes)

Perhaps you didn’t interpolate the variables?

julia> @btime view_test(A, B);
  195.690 ns (1 allocation: 48 bytes)

I’m amazed at how fast your system is btw

1 Like