Help to reduce number of allocations

Your P is not concretely typed either (should be Matrix{Float64} instead of just Matrix). You should just put @time on every line to see how much each line allocate, get an overview of where the major allocations are and make a strategy to work around that. Right now, you are shooting too much from the hips.

Also, please read through the performance tips section and show your efforts in applying them, @code_warntype for example.


Ok, thanks guys!

TimerOutputs.jl can help a bit:

using LinearAlgebra
using TimerOutputs

function teste()
    P = Vector{Matrix}(undef, 6000)

    P[1] = zeros(18,18)

    @inbounds for k = 2:6000
        @timeit "compute PP" Pp   = P[k-1] + rand(18,18)
        @timeit "compute K"  K    = pinv(I + Pp)
        @timeit "compute Pk" P[k] = (I - K)*Pp*(I - K)' + K*I*K'

julia> teste();
                              Time                   Allocations
                      ──────────────────────   ───────────────────────
   Tot / % measured:       759ms / 96.2%            357MiB / 100%

 Section      ncalls     time   %tot     avg     alloc   %tot      avg
 compute K     6.00k    638ms  87.4%   106μs    215MiB  60.2%  36.7KiB
 compute Pk    6.00k   79.0ms  10.8%  13.2μs    110MiB  30.9%  18.9KiB
 compute PP    6.00k   13.2ms  1.80%  2.19μs   31.5MiB  8.82%  5.38KiB

Excellent suggestion, I did not know about TimerOutputs! Thanks :slight_smile:

Question every array you allocate:

Why are you making a new Pp at each iteration, why a new K and a new I_K? Why are you not using broadcasting for additions? Why allocate a new random matrix every time?

I see, the random is just something to mimic the original code, that has some computations before it. I will try to follow your advices to reduce the number of allocations. Thanks!

Most of the allocations seems to be in pinv anyway, so might not actually do much to preallocate everything else. That’s why doing some light profiling before going too far down the optimization hole is important.

1 Like

Yes, I realized that after profiling it with your suggestion @timeit.

By allocating some matrices (instead of using the UniformScalling operator), I managed to reduce the allocation from 3,8 GiB to 2,9 GiB! Thanks guys :slight_smile:

Off-Topic: I get an ERROR: LAPACKException(17) from this code.

julia> using LinearAlgebra

julia> function teste()
           P = Vector{Matrix}(undef, 6000)

           P[1] = zeros(18,18)

           aux1  = zeros(18,18)
           aux2  = zeros(18,18)
           K_I_K = zeros(18,18)

           @inbounds for k = 2:6000
               Pp   = P[k-1] + rand(18,18)
               K    = pinv(I + Pp)
               I_K  = I-K

               mul!(aux1, I_K, Pp)
               mul!(aux2, aux1, I_K')
               mul!(K_I_K, K, K')

               P[k] = aux1 + aux2

teste (generic function with 1 method)

julia> teste()
ERROR: LAPACKException(17)
 [1] chklapackerror at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\lapack.jl:38 [inlined]
 [2] gesdd!(::Char, ::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\lapack.jl:1602
 [3] #svd!#60(::Bool, ::Function, ::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\svd.jl:63
 [4] #svd! at .\none:0 [inlined]
 [5] #svd#61 at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\svd.jl:106 [inlined]
 [6] #svd at .\none:0 [inlined]
 [7] pinv(::Array{Float64,2}, ::Float64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\dense.jl:1286
 [8] pinv(::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\dense.jl:1296
 [9] teste() at .\REPL[2]:12
 [10] top-level scope at none:0

What am I doing wrong/why does LAPACK complain?

It can happen because of pinv. The rand in you case was letting to a matrix that pinv cannot invert. Change rand to something like I and it should work.

EDIT: No it doesn’t, maybe it is becoming too big (Inf).

It worked if I use 0.01rand(18,18) for example, so I guess yes it may give too large values (inf). Thank you.

There’s also Julia’s built-in performance and memory profiler, if you add this at the end of your script:

using Profile
teste() # run w/o profiling first to ensure compiled & warmed up
@profile teste()

You’ll get something like the below (trimmed a bit):

361 ...usr/share/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
  14  allocation_test.jl:9; teste()
    13 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
    1  .../usr/share/julia/stdlib/v1.0/Random/src/Random.jl:241; rand
  321 allocation_test.jl:10; teste()
    318 ...are/julia/stdlib/v1.0/LinearAlgebra/src/dense.jl:1296; pinv(::Array{Float64,2})
    3   .../stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:90; +(::UniformScaling{Bool}, ::Array{Float64,2})
  22  allocation_test:11; teste()
    2 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
    6 ./operators.jl:502; *(::Array{Float64,2}, ::Array{Float64,2}, ::Adjoint{Float64,Array{Float64,2}})
    5 ./operators.jl:502; *
    5 ...a/stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:124; -(::UniformScaling{Bool}, ::Array{Float64,2})

I find this tool very useful. Increase the maxdepth for more detail. There’s also a graphical interface from within Atom if you prefer that.

Then there’s memory allocation tracking. If you start Julia with the flag --track-allocation=user and put this at the end of your script:

using Profile
teste() # run w/o profiling first to ensure compiled & warmed up

Then quit Julia, and locate the newly created .mem file. It will have annotated each line with how much it allocated:

        - function teste()
    50160     P = Vector{Matrix}(undef, 6000)
     2752     P[1] = zeros(18,18)
        0     @inbounds for k = 2:6000
 33018496         Pp   = P[k-1] + rand(18,18)
        0         K    = pinv(I + Pp)
 16597072         P[k] = (I - K)*Pp*(I - K)' + K*I*K'
        -     end
        0     P
        - end

Unfortunately, as you can see above, it’s not very accurate since first of all it shows 0 bytes allocated for the pinv line, and second the total number of bytes allocated is way off (it should be ~357 MB). Not sure what’s going on there, I have the impression that this worked well in Julia 0.6 but hasn’t quite worked since. When/if this works, it is an awesome tool. Apart from the obvious use case of seeing what allocates, you can also use it to detect type instabilities.

1 Like

What does your updated code look like? I’m surprised it wasn’t possible to reduce it a lot more.

I discovered something very strange when looking at your code now.

It occurred to me that this: K*I*K' should be replaced with this: K * K'to reduce allocations. And, indeed it does, but the performance result seems bizarre to me:

julia> foo(R) = R * R'
foo (generic function with 2 methods)

julia> fooI(R) = R * I * R'
fooI (generic function with 1 method)

julia> R = rand(100, 100);

julia> foo(R) == fooI(R)

julia> @btime foo($R);
  59.028 μs (2 allocations: 78.20 KiB)

julia> @btime fooI($R);
  38.254 μs (4 allocations: 156.41 KiB)

For small matrices, fooI is consistently faster, but when the matrix size grows larger than roughly 200x200 foo becomes faster.

This is on Julia v1.0.3 with OpenBLAS.

1 Like

Try @which to find out what method is called. Maybe the one with I calls a Julia implementation that is more efficient than Blas for small matrices?

julia> @which R*R'
*(A::AbstractArray{T,2} where T, B::AbstractArray{T,2} where T) in LinearAlgebra at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/matmul.jl:140

julia> @which R*I*R'
*(a, b, c, xs...) in Base at operators.jl:502
1 Like

Hi @DNF,

I took me a while to create a MWE, but I figured out the part of the code that was allocating:

using LinearAlgebra

function simulate()

    k_max = 6000
    I18 = Matrix{Float64}(I,18,18)
    O3 = zeros(3,3)

    P0  = copy(I18)
    Q   = 1e-20I18
    R   = diagm(0 => [3e-2, 3e-2, 3e-2, 5e7, 5e7, 5e7])

    Fk_1 = I18

    Hk = [ -I  O3 -I   I  O3  O3;
           -I  O3  O3  O3 -I   I; ]

    Pu  = copy(P0)
    Pp  = similar(P0)
    K   = zeros(18,6)
    aux = zeros(18,18)

    result = Vector{Float64}(undef,0)
    push!(result, tr(Pu))

    @inbounds for k = 2:k_max
        Pp  .= Fk_1*Pu*Fk_1' .+ Q
        K   .= Pp*Hk'*pinv(R .+ Hk*Pp*Hk')
        aux .= I - K*Hk
        Pu  .= aux*Pp*aux' + K*R*K'

        # Compute and store the result.
        push!(result, tr(Pu))


which leads to:

julia> @btime simulate()
  116.695 ms (246202 allocations: 242.91 MiB)

EDIT: At least in my attempts, using mul! did not help to reduce the allocation here either.

P.S.: I cannot pre-allocate the vector result because in the original code the loop can be interrupted if a condition impossible to simulate here is met.


After analyzing the code with @timeit I discovered that I had a function that was allocating a lot:

skew_matrix(v::AbstractVector{T}) where T<:Real =
    SMatrix{3,3,T,9}(  0,   -v[3], +v[2],
                     +v[3],   0,   -v[1],
                     -v[2], +v[1],   0  )'

function meas_matrix(s::AbstractVector{T}, b::AbstractVector{T}) where T<:Real
    sx = skew_matrix(s)
    bx = skew_matrix(b)
    O3 = zeros(T,3,3)

    [ -sx  O3 -sx   I  O3  O3;
      -bx  O3  O3  O3 -bx   I; ]

It turn out that the problem (50% of the allocation) was happening now due to the O3. Since I would need O3 to depend on the type T, I solved this by using @generated function:

@generated function meas_matrix(s::AbstractVector{T}, b::AbstractVector{T}) where T<:Real
        # These are allocated only when the method is compiled.
        O3 = $(zeros(T,3,3))
        I3 = $(Matrix{T}(I,3,3))

        sx = skew_matrix(s)
        bx = skew_matrix(b)

        return [ -sx  O3 -sx  I3  O3  O3;
                 -bx  O3  O3  O3 -bx  I3; ]

The allocation dropped from 200 MiB to 15 MiB. I could go even further if I defined the matrix element by element, then it goes to 9 MiB, but I do not see any major gains.

Have you checked out FillArrays.jl? It would drop allocations without resorting to @generated

1 Like

Awesome! Thanks, I did not know about this package. It seems very nice!