Help to reduce number of allocations

Hi guys,

I need to store something like 6000 18x18 matrices. Maybe there is nothing more to do, but maybe someone can help me to reduce the number of allocations to increase the performance! Here is a simplified version of the code:

``````using LinearAlgebra

function teste()
P = Vector{Matrix}(undef, 6000)

P[1] = zeros(18,18)

@inbounds for k = 2:6000
Pp   = P[k-1] + rand(18,18)
K    = pinv(I + Pp)
P[k] = (I - K)*Pp*(I - K)' + K*I*K'
end

P
end
``````

With this, I get:

``````julia> @btime teste()
474.047 ms (239453 allocations: 357.04 MiB)
``````
1 Like

Preallocate the temporaries needed in the matrix multiplications and use `mul!`, also, donβt compute the same thing multiple times (`I - K`).

When you use static vectors (StaticArrays.jl) then it is often possible to write things in a βmathy styleβ and still get good performance but with heap allocated arrays it is typically not.

3 Likes

Thanks @kristoffer.carlsson, I tried something like this:

``````using LinearAlgebra

function teste()
P = Vector{Matrix}(undef, 6000)

P[1] = zeros(18,18)

aux1  = zeros(18,18)
aux2  = zeros(18,18)
K_I_K = zeros(18,18)

@inbounds for k = 2:6000
Pp   = P[k-1] + rand(18,18)
K    = pinv(I + Pp)
I_K  = I-K

mul!(aux1, I_K, Pp)
mul!(aux2, aux1, I_K')
mul!(K_I_K, K, K')

P[k] = aux1 + aux2
end

P
end
``````

But it was actually worse (needed more time) without any drop in the number of allocations, did I miss something?

`````` julia> @btime teste()
562.355 ms (239453 allocations: 325.82 MiB)
``````
1 Like

The element type of `P` is not concrete.

2 Likes

Your `P` is not concretely typed either (should be `Matrix{Float64}` instead of just `Matrix`). You should just put `@time` on every line to see how much each line allocate, get an overview of where the major allocations are and make a strategy to work around that. Right now, you are shooting too much from the hips.

Also, please read through the performance tips section and show your efforts in applying them, `@code_warntype` for example.

2 Likes

Ok, thanks guys!

TimerOutputs.jl can help a bit:

``````using LinearAlgebra
using TimerOutputs

function teste()
reset_timer!()
P = Vector{Matrix}(undef, 6000)

P[1] = zeros(18,18)

@inbounds for k = 2:6000
@timeit "compute PP" Pp   = P[k-1] + rand(18,18)
@timeit "compute K"  K    = pinv(I + Pp)
@timeit "compute Pk" P[k] = (I - K)*Pp*(I - K)' + K*I*K'
end

print_timer()
P
end
``````
``````julia> teste();
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time                   Allocations
ββββββββββββββββββββββ   βββββββββββββββββββββββ
Tot / % measured:       759ms / 96.2%            357MiB / 100%

Section      ncalls     time   %tot     avg     alloc   %tot      avg
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
compute K     6.00k    638ms  87.4%   106ΞΌs    215MiB  60.2%  36.7KiB
compute Pk    6.00k   79.0ms  10.8%  13.2ΞΌs    110MiB  30.9%  18.9KiB
compute PP    6.00k   13.2ms  1.80%  2.19ΞΌs   31.5MiB  8.82%  5.38KiB
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
``````
5 Likes

Excellent suggestion, I did not know about TimerOutputs! Thanks

Question every array you allocate:

Why are you making a new `Pp` at each iteration, why a new `K` and a new `I_K`? Why are you not using broadcasting for additions? Why allocate a new random matrix every time?

I see, the random is just something to mimic the original code, that has some computations before it. I will try to follow your advices to reduce the number of allocations. Thanks!

Most of the allocations seems to be in `pinv` anyway, so might not actually do much to preallocate everything else. Thatβs why doing some light profiling before going too far down the optimization hole is important.

1 Like

Yes, I realized that after profiling it with your suggestion `@timeit`.

By allocating some matrices (instead of using the `UniformScalling` operator), I managed to reduce the allocation from 3,8 GiB to 2,9 GiB! Thanks guys

Off-Topic: I get an ERROR: LAPACKException(17) from this code.

``````julia> using LinearAlgebra

julia> function teste()
P = Vector{Matrix}(undef, 6000)

P[1] = zeros(18,18)

aux1  = zeros(18,18)
aux2  = zeros(18,18)
K_I_K = zeros(18,18)

@inbounds for k = 2:6000
Pp   = P[k-1] + rand(18,18)
K    = pinv(I + Pp)
I_K  = I-K

mul!(aux1, I_K, Pp)
mul!(aux2, aux1, I_K')
mul!(K_I_K, K, K')

P[k] = aux1 + aux2
end

P
end
teste (generic function with 1 method)

julia> teste()
ERROR: LAPACKException(17)
Stacktrace:
[3] #svd!#60(::Bool, ::Function, ::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\svd.jl:63
[4] #svd! at .\none:0 [inlined]
[6] #svd at .\none:0 [inlined]
[9] teste() at .\REPL[2]:12
[10] top-level scope at none:0
``````

What am I doing wrong/why does LAPACK complain?

It can happen because of `pinv`. The `rand` in you case was letting to a matrix that `pinv` cannot invert. Change `rand` to something like `I` and it should work.

EDIT: No it doesnβt, maybe it is becoming too big (`Inf`).

It worked if I use `0.01rand(18,18)` for example, so I guess yes it may give too large values (inf). Thank you.

Thereβs also Juliaβs built-in performance and memory profiler, if you add this at the end of your script:

``````using Profile
teste() # run w/o profiling first to ensure compiled & warmed up
@profile teste()
Profile.print(maxdepth=6)
``````

Youβll get something like the below (trimmed a bit):

``````361 ...usr/share/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
14  allocation_test.jl:9; teste()
13 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
1  .../usr/share/julia/stdlib/v1.0/Random/src/Random.jl:241; rand
321 allocation_test.jl:10; teste()
318 ...are/julia/stdlib/v1.0/LinearAlgebra/src/dense.jl:1296; pinv(::Array{Float64,2})
3   .../stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:90; +(::UniformScaling{Bool}, ::Array{Float64,2})
22  allocation_test:11; teste()
2 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
5 ./operators.jl:502; *
5 ...a/stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:124; -(::UniformScaling{Bool}, ::Array{Float64,2})
``````

I find this tool very useful. Increase the `maxdepth` for more detail. Thereβs also a graphical interface from within Atom if you prefer that.

Then thereβs memory allocation tracking. If you start Julia with the flag `--track-allocation=user` and put this at the end of your script:

``````using Profile
teste() # run w/o profiling first to ensure compiled & warmed up
Profile.clear_malloc_data()
teste()
``````

Then quit Julia, and locate the newly created `.mem` file. It will have annotated each line with how much it allocated:

``````        - function teste()
50160     P = Vector{Matrix}(undef, 6000)
-
2752     P[1] = zeros(18,18)
-
0     @inbounds for k = 2:6000
33018496         Pp   = P[k-1] + rand(18,18)
0         K    = pinv(I + Pp)
16597072         P[k] = (I - K)*Pp*(I - K)' + K*I*K'
-     end
-
0     P
- end
``````

Unfortunately, as you can see above, itβs not very accurate since first of all it shows 0 bytes allocated for the `pinv` line, and second the total number of bytes allocated is way off (it should be ~357 MB). Not sure whatβs going on there, I have the impression that this worked well in Julia 0.6 but hasnβt quite worked since. When/if this works, it is an awesome tool. Apart from the obvious use case of seeing what allocates, you can also use it to detect type instabilities.

1 Like

What does your updated code look like? Iβm surprised it wasnβt possible to reduce it a lot more.

I discovered something very strange when looking at your code now.

It occurred to me that this: `K*I*K'` should be replaced with this: `K * K'`to reduce allocations. And, indeed it does, but the performance result seems bizarre to me:

``````julia> foo(R) = R * R'
foo (generic function with 2 methods)

julia> fooI(R) = R * I * R'
fooI (generic function with 1 method)

julia> R = rand(100, 100);

julia> foo(R) == fooI(R)
true

julia> @btime foo(\$R);
59.028 ΞΌs (2 allocations: 78.20 KiB)

julia> @btime fooI(\$R);
38.254 ΞΌs (4 allocations: 156.41 KiB)
``````

For small matrices, `fooI` is consistently faster, but when the matrix size grows larger than roughly 200x200 `foo` becomes faster.

This is on Julia v1.0.3 with OpenBLAS.

1 Like

Try `@which` to find out what method is called. Maybe the one with `I` calls a Julia implementation that is more efficient than Blas for small matrices?

``````julia> @which R*R'
*(A::AbstractArray{T,2} where T, B::AbstractArray{T,2} where T) in LinearAlgebra at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/matmul.jl:140

julia> @which R*I*R'
*(a, b, c, xs...) in Base at operators.jl:502
``````
1 Like