# Prevent huge number of allocations mutating columns of arrays

Hi, I am writing some code to update the columns in an array by repeatedly solving a linear system. I am trying to allocate arrays for the solutions, left-hand-side, and right-hand-side at the beginning of the code and then mutate them to increase performance. Below is a minimal working example (for the use case n will be much larger)

``````
using LinearAlgebra, Distributions, Random, LinearSolve, BenchmarkTools, InvertedIndices

mutable struct G_hat{M<:Matrix{Float64}} # creating a mutable struct with a guess matrix we will update
mat::M
end

function G_hat(N::Int64) # outer constructor function for our guess
G_hat(ones(N,N))
end

struct B_struct{A<:Matrix{Float64}} # immutable struct object for S
mat::A
end

function make_B(N::Int64) # outer constructor function for B_mat
B_struct(rand(N,N))
end

function init_lhs(n::Int64) # outer constructor function for lhs
Vector{Float64}(undef, n-1)
end

f(a,b) = a .- b

function update_lhs!(M, B_j_sl, G_minus_j, B_j)

A = similar(B_j_sl)
mul!(A, G_minus_j', B_j)
M .= B_j_sl - A
return nothing
end

function init_rhs(n::Int64) # outer constructor function for inner prod starting in place one!
Matrix{Float64}(undef, (n-1, n-1))
end

function update_rhs!(M, K, B_sl)
M .= f.(K, B_sl)
return nothing
end

make_ols_estimates! = function(b_hat, G_minus_j, B_minus_j, B_j, K, j, lhs, rhs)

update_lhs!(lhs, @view(B_j[Not(j),:]), G_minus_j,  B_j) # update lhs - lots of allocations for some reason
update_rhs!(rhs, K, @view(B_minus_j[Not(j),:])) # update rhs - still lots of allocations!
b_hat .= solve(LinearProblem(rhs, lhs)).u

return nothing

end

sweep_column_two! = function(G_hat_1, B, K, j,p, b_hat, lhs, rhs)

make_ols_estimates!(b_hat, @view(G_hat_1[:,Not(j)]), @view(B[:,Not(j)]), @view(B[:,j]),K, j, lhs, rhs)
pos = quantile(vec(b_hat), p)
replace!(x -> x>pos ? zero(Float64) : x, b_hat)
replace!(x -> x≤pos ? 1.0 : x, b_hat)
G_hat_1[Not(j),j] .= b_hat
return nothing

end

sweep_matrix! = function(G_hat_1,B,K, p, n, lhs, rhs)
for j in 1:n
b_hat = Vector{Float64}(undef, n-1) # allocate vector for results
sweep_column_two!(G_hat_1, B, @view(K[Not(j),Not(j)]), j, p, b_hat, lhs, rhs)# update column
K[:,j] .= rand(n) # update K
end
return nothing
end

dummy_one_run = function()
n = 100
p = 0.3
gamma = 0.9
maxit = 1
B = make_B(n).mat

G_hat_1 = G_hat(n).mat
K = rand(n,n)
lhs = init_lhs(n)
rhs = init_rhs(n)

for it in 1:maxit
println(it)
sweep_matrix!(G_hat_1, B,K, p, n, lhs, rhs)
end
return nothing
end

@btime dummy_one_run()
``````

For some reason, I end up creating a huge number of allocations (392480) somewhere in the inner loops when trying to mutate. The allocations make the code run very slowly. Yet when `@btime` the individual lines within the `update_lhs!`, `update_rhs!` functions, they seem to perform very few allocations. I would really appreciate some help, as I cannot work out where the allocations are happening…

It’s not exactly a Minimal working example…

`update_lhs!` allocates `A` each time. This is where your allocations may come from.
Try to preallocate `A` and pass it into `update_lhs!`.

The next step is (always) to run a profiler. There is really no point trying to optimize code until you know where it actually spends most of its time.
Once you have those results, you can construct a more minimal example that will be easier to work on.

Finally, why the strange notation `sweep_column_two! = function ...` instead of `function sweep_column_two!(args) ...`?

1 Like

Thanks for your comment! Maybe it is easier to focus on the `update_lhs!` function. After implementing your suggestion, I get this function

``````function update_lhs!(M::Vector{Float64}, B_j_sl::Vector{Float64}, G_minus_j::Matrix{Float64}, B_j::Vector{Float64}, A::Vector{Float64})

mul!(A, G_minus_j', B_j)
M .= B_j_sl - A
return nothing
end
``````

This function still makes seven allocations each time it is applied to try and mutate the vector. Similar holds for the `update_rhs!` function, which makes 8 allocations each time. I do not understand why, as my understanding was that `.=` should alter the object inplace and give me 0 allocations. I wondered whether it was due to an interaction/type-stability problem with some object further up (hence the less minimal ‘minimal’ example)…

(Thanks for spotting the strange notation - I have had to work in other languages recently due to non-Julia co-authors which I think is to blame here)

You need a dot for every operation, otherwise `B_j_sl - A` just makes another temproary.

``````M .= B_j_sl .- A
# or
@. M = B_j_sl - A # @. applies . everywhere
``````

I’d really recommend profiling your code before trying to do any optimization otherwise you’ll just spend more time and effort than necessary.

If you use VSCode, getting a Flamegraph of the execution is as simple as running `@profview dummy_one_run()`. If you want to get allocation line-by-line all you need is a cmd arg “–track-allocations=user” and then running your program (perhaps clear the counters after precompiling the workload as recommended in the manual).

Btw in the function `sweep_matrix!` this allocates a vector everytime: `K[:,j] .= rand(n)` You need to use `@views rand!(K[:,j])` (or similar) instead.

I have a related question about profiling (relevant here, so I will add it to the current thread):

Suppose allocations are the reasons why code is slow. Does the profiler accurately attribute the time spent on allocating and freeing up memory to the lines of code that allocate?

From my understanding how profilers work the answer would seem to be “no, GC time is not allocated to the lines that allocate.”

If this is correct, then

1. How can I find out which allocations slow down my code?
2. Where does the profiler put the blame for the GC time?

Thanks.

1 Like

Perhaps I am misunderstanding the profiler, but I did of course profile before I posted and the profiler results are quite opaque. Inner functions appear to make no allocations, but function seems to spend lots of time memory on these parts of the the outer functions. Here is an example of this from the profiler output with the `update_rhs!` function. Could this be the GC time not being allocated to the correct lines?

``````    - function update_rhs!(M, K, B_sl)
0     @. M = K - B_sl
0     return nothing
- end
-
- make_ols_estimates! = function(b_hat, G_minus_j, B_minus_j, B_j, K, j, lhs, rhs, A)
-
0     update_lhs!(lhs, @view(B_j[Not(j),:]), G_minus_j,  B_j, A) # update lhs
89600     update_rhs!(rhs, K, @view(B_minus_j[Not(j),:])) # update rhs
3200     b_hat .= solve(LinearProblem(rhs, lhs)).u
-
0     return nothing
-
- end
``````
1 Like

For future reference, I think I worked out the issue. The problem is that `Not()` forces an allocation at each slice even when wrapped in `@view()`. So each of these calls creates allocations. Hence also why there are no allocations within the `update_rhs!` function, but there are when I call it within `make_ols_estimates!()`.

If I use `x[1:end .!= j,: ]` instead of using `x[Not(j),:]`, then most allocations go away.

2 Likes

I think the issue of perfomance and allocations is more subtle than that, and is related to cache misses rather than time spent allocating and freeing up memory.

For example: if you allocate too much memory instead of reusing existing memory, some useful part of the memory won’t be in the L3 cache when you need it, having been flushed to the RAM instead, which is a lot slower. If instead you reuse as much memory as possible, and more useful variables stay within the cache, your program will be substantially faster.

Does this mean the profiler correctly attributes time to lines of code that allocate? Or would I need some other tool to discover the cost of allocations?

Thanks.

Yes, the profiler will give you the time it takes to perform the allocation, but more importantly it should tell you how many MB it is allocating. If you have fat allocations, and you can remove them, there is a potential speed-up as well. Optimizing for memory efficiency is different than optimizing for compute efficiency. Arguably, I think memory efficiency is harder and less well understood.

Are you referring to the “Allocation Profiler” (`Profile.Allocs.@profile`)?

I have not used that one (cannot get `PProf.jl` to install dependencies on my M1 mac, which makes it hard to visualize results).

So I take it that inferring time spent from bytes allocated is basically trial and error? By that I mean that one would try to remove large allocations (or frequent ones?), rerun the profiler and see what happens to runtime.

Thanks.

Pretty much. Memory efficiency just works like that, regardless of the language.

I guess the really nasty allocations are the frequent ones (inside hot loops). I usually try to allocate a sufficiently large Array outside such functions, and then reuse that array.

Thanks again.

I don’t think you need PProf.jl to visualise the results, it’s just one possible way to do it. ProfileCanvas.jl produces a flamegraph similar to the one in VSCode.

Thanks - that useful for visualizing allocations.