Prevent huge number of allocations mutating columns of arrays

Hi, I am writing some code to update the columns in an array by repeatedly solving a linear system. I am trying to allocate arrays for the solutions, left-hand-side, and right-hand-side at the beginning of the code and then mutate them to increase performance. Below is a minimal working example (for the use case n will be much larger)


using LinearAlgebra, Distributions, Random, LinearSolve, BenchmarkTools, InvertedIndices

mutable struct G_hat{M<:Matrix{Float64}} # creating a mutable struct with a guess matrix we will update
    mat::M
end

function G_hat(N::Int64) # outer constructor function for our guess
    G_hat(ones(N,N))
end

struct B_struct{A<:Matrix{Float64}} # immutable struct object for S
    mat::A
end

function make_B(N::Int64) # outer constructor function for B_mat
    B_struct(rand(N,N))
end

function init_lhs(n::Int64) # outer constructor function for lhs
    Vector{Float64}(undef, n-1)
end

f(a,b) = a .- b

function update_lhs!(M, B_j_sl, G_minus_j, B_j)
   
    A = similar(B_j_sl)
    mul!(A, G_minus_j', B_j) 
    M .= B_j_sl - A
    return nothing
end

function init_rhs(n::Int64) # outer constructor function for inner prod starting in place one!
    Matrix{Float64}(undef, (n-1, n-1))
end


function update_rhs!(M, K, B_sl)
    M .= f.(K, B_sl)
    return nothing
end

make_ols_estimates! = function(b_hat, G_minus_j, B_minus_j, B_j, K, j, lhs, rhs)

    update_lhs!(lhs, @view(B_j[Not(j),:]), G_minus_j,  B_j) # update lhs - lots of allocations for some reason
    update_rhs!(rhs, K, @view(B_minus_j[Not(j),:])) # update rhs - still lots of allocations!
    b_hat .= solve(LinearProblem(rhs, lhs)).u

    return nothing

end

sweep_column_two! = function(G_hat_1, B, K, j,p, b_hat, lhs, rhs)
    
    make_ols_estimates!(b_hat, @view(G_hat_1[:,Not(j)]), @view(B[:,Not(j)]), @view(B[:,j]),K, j, lhs, rhs)
    pos = quantile(vec(b_hat), p) 
    replace!(x -> x>pos ? zero(Float64) : x, b_hat)
    replace!(x -> xā‰¤pos ? 1.0 : x, b_hat)
    G_hat_1[Not(j),j] .= b_hat
    return nothing

end


sweep_matrix! = function(G_hat_1,B,K, p, n, lhs, rhs)
    for j in 1:n 
        b_hat = Vector{Float64}(undef, n-1) # allocate vector for results
        sweep_column_two!(G_hat_1, B, @view(K[Not(j),Not(j)]), j, p, b_hat, lhs, rhs)# update column 
        K[:,j] .= rand(n) # update K
    end
    return nothing
end


dummy_one_run = function()
    n = 100
    p = 0.3
    gamma = 0.9
    maxit = 1
    B = make_B(n).mat

    G_hat_1 = G_hat(n).mat
    K = rand(n,n)
    lhs = init_lhs(n)
    rhs = init_rhs(n)

    for it in 1:maxit 
        println(it)
        sweep_matrix!(G_hat_1, B,K, p, n, lhs, rhs)
    end
    return nothing
end

@btime dummy_one_run()

For some reason, I end up creating a huge number of allocations (392480) somewhere in the inner loops when trying to mutate. The allocations make the code run very slowly. Yet when @btime the individual lines within the update_lhs!, update_rhs! functions, they seem to perform very few allocations. I would really appreciate some help, as I cannot work out where the allocations are happeningā€¦

Itā€™s not exactly a Minimal working exampleā€¦

update_lhs! allocates A each time. This is where your allocations may come from.
Try to preallocate A and pass it into update_lhs!.

The next step is (always) to run a profiler. There is really no point trying to optimize code until you know where it actually spends most of its time.
Once you have those results, you can construct a more minimal example that will be easier to work on.

Finally, why the strange notation sweep_column_two! = function ... instead of function sweep_column_two!(args) ...?

1 Like

Thanks for your comment! Maybe it is easier to focus on the update_lhs! function. After implementing your suggestion, I get this function

function update_lhs!(M::Vector{Float64}, B_j_sl::Vector{Float64}, G_minus_j::Matrix{Float64}, B_j::Vector{Float64}, A::Vector{Float64})
   
    mul!(A, G_minus_j', B_j) 
    M .= B_j_sl - A
    return nothing
end

This function still makes seven allocations each time it is applied to try and mutate the vector. Similar holds for the update_rhs! function, which makes 8 allocations each time. I do not understand why, as my understanding was that .= should alter the object inplace and give me 0 allocations. I wondered whether it was due to an interaction/type-stability problem with some object further up (hence the less minimal ā€˜minimalā€™ example)ā€¦

(Thanks for spotting the strange notation - I have had to work in other languages recently due to non-Julia co-authors which I think is to blame here)

You need a dot for every operation, otherwise B_j_sl - A just makes another temproary.
So instead it should be

M .= B_j_sl .- A
# or
@. M = B_j_sl - A # @. applies . everywhere

Iā€™d really recommend profiling your code before trying to do any optimization otherwise youā€™ll just spend more time and effort than necessary.

If you use VSCode, getting a Flamegraph of the execution is as simple as running @profview dummy_one_run(). If you want to get allocation line-by-line all you need is a cmd arg ā€œā€“track-allocations=userā€ and then running your program (perhaps clear the counters after precompiling the workload as recommended in the manual).

Btw in the function sweep_matrix! this allocates a vector everytime: K[:,j] .= rand(n) You need to use @views rand!(K[:,j]) (or similar) instead.

I have a related question about profiling (relevant here, so I will add it to the current thread):

Suppose allocations are the reasons why code is slow. Does the profiler accurately attribute the time spent on allocating and freeing up memory to the lines of code that allocate?

From my understanding how profilers work the answer would seem to be ā€œno, GC time is not allocated to the lines that allocate.ā€

If this is correct, then

  1. How can I find out which allocations slow down my code?
  2. Where does the profiler put the blame for the GC time?

Thanks.

1 Like

Perhaps I am misunderstanding the profiler, but I did of course profile before I posted and the profiler results are quite opaque. Inner functions appear to make no allocations, but function seems to spend lots of time memory on these parts of the the outer functions. Here is an example of this from the profiler output with the update_rhs! function. Could this be the GC time not being allocated to the correct lines?

    - function update_rhs!(M, K, B_sl)
    0     @. M = K - B_sl
    0     return nothing
    - end
    - 
    - make_ols_estimates! = function(b_hat, G_minus_j, B_minus_j, B_j, K, j, lhs, rhs, A)
    - 
    0     update_lhs!(lhs, @view(B_j[Not(j),:]), G_minus_j,  B_j, A) # update lhs
89600     update_rhs!(rhs, K, @view(B_minus_j[Not(j),:])) # update rhs
 3200     b_hat .= solve(LinearProblem(rhs, lhs)).u
    - 
    0     return nothing
    - 
    - end
1 Like

For future reference, I think I worked out the issue. The problem is that Not() forces an allocation at each slice even when wrapped in @view(). So each of these calls creates allocations. Hence also why there are no allocations within the update_rhs! function, but there are when I call it within make_ols_estimates!().

If I use x[1:end .!= j,: ] instead of using x[Not(j),:], then most allocations go away.

2 Likes

I think the issue of perfomance and allocations is more subtle than that, and is related to cache misses rather than time spent allocating and freeing up memory.

For example: if you allocate too much memory instead of reusing existing memory, some useful part of the memory wonā€™t be in the L3 cache when you need it, having been flushed to the RAM instead, which is a lot slower. If instead you reuse as much memory as possible, and more useful variables stay within the cache, your program will be substantially faster.

Does this mean the profiler correctly attributes time to lines of code that allocate? Or would I need some other tool to discover the cost of allocations?

Thanks.

Yes, the profiler will give you the time it takes to perform the allocation, but more importantly it should tell you how many MB it is allocating. If you have fat allocations, and you can remove them, there is a potential speed-up as well. Optimizing for memory efficiency is different than optimizing for compute efficiency. Arguably, I think memory efficiency is harder and less well understood.

Are you referring to the ā€œAllocation Profilerā€ (Profile.Allocs.@profile)?

I have not used that one (cannot get PProf.jl to install dependencies on my M1 mac, which makes it hard to visualize results).

So I take it that inferring time spent from bytes allocated is basically trial and error? By that I mean that one would try to remove large allocations (or frequent ones?), rerun the profiler and see what happens to runtime.

Thanks.

Pretty much. Memory efficiency just works like that, regardless of the language.

I guess the really nasty allocations are the frequent ones (inside hot loops). I usually try to allocate a sufficiently large Array outside such functions, and then reuse that array.

Thanks again.

I donā€™t think you need PProf.jl to visualise the results, itā€™s just one possible way to do it. ProfileCanvas.jl produces a flamegraph similar to the one in VSCode.

Thanks - that useful for visualizing allocations.