Help to reduce number of allocations


#1

Hi guys,

I need to store something like 6000 18x18 matrices. Maybe there is nothing more to do, but maybe someone can help me to reduce the number of allocations to increase the performance! Here is a simplified version of the code:

using LinearAlgebra

function teste()
    P = Vector{Matrix}(undef, 6000)

    P[1] = zeros(18,18)

    @inbounds for k = 2:6000
        Pp   = P[k-1] + rand(18,18)
        K    = pinv(I + Pp)
        P[k] = (I - K)*Pp*(I - K)' + K*I*K'
    end

    P
end

With this, I get:

julia> @btime teste()
  474.047 ms (239453 allocations: 357.04 MiB)

#2

Preallocate the temporaries needed in the matrix multiplications and use mul!, also, don’t compute the same thing multiple times (I - K).

When you use static vectors (StaticArrays.jl) then it is often possible to write things in a β€œmathy style” and still get good performance but with heap allocated arrays it is typically not.


#3

Thanks @kristoffer.carlsson, I tried something like this:

using LinearAlgebra

function teste()
    P = Vector{Matrix}(undef, 6000)

    P[1] = zeros(18,18)

    aux1  = zeros(18,18)
    aux2  = zeros(18,18)
    K_I_K = zeros(18,18)

    @inbounds for k = 2:6000
        Pp   = P[k-1] + rand(18,18)
        K    = pinv(I + Pp)
        I_K  = I-K

        mul!(aux1, I_K, Pp)
        mul!(aux2, aux1, I_K')
        mul!(K_I_K, K, K')

        P[k] = aux1 + aux2
    end

    P
end

But it was actually worse (needed more time) without any drop in the number of allocations, did I miss something?

 julia> @btime teste()
  562.355 ms (239453 allocations: 325.82 MiB)

#4

The element type of P is not concrete.


#5

Your P is not concretely typed either (should be Matrix{Float64} instead of just Matrix). You should just put @time on every line to see how much each line allocate, get an overview of where the major allocations are and make a strategy to work around that. Right now, you are shooting too much from the hips.

Also, please read through the performance tips section and show your efforts in applying them, @code_warntype for example.


#6

Ok, thanks guys!


#7

TimerOutputs.jl can help a bit:

using LinearAlgebra
using TimerOutputs

function teste()
    reset_timer!()
    P = Vector{Matrix}(undef, 6000)

    P[1] = zeros(18,18)

    @inbounds for k = 2:6000
        @timeit "compute PP" Pp   = P[k-1] + rand(18,18)
        @timeit "compute K"  K    = pinv(I + Pp)
        @timeit "compute Pk" P[k] = (I - K)*Pp*(I - K)' + K*I*K'
    end

    print_timer()
    P
end
julia> teste();
 ─────────────────────────────────────────────────────────────────────
                              Time                   Allocations
                      ──────────────────────   ───────────────────────
   Tot / % measured:       759ms / 96.2%            357MiB / 100%

 Section      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────
 compute K     6.00k    638ms  87.4%   106ΞΌs    215MiB  60.2%  36.7KiB
 compute Pk    6.00k   79.0ms  10.8%  13.2ΞΌs    110MiB  30.9%  18.9KiB
 compute PP    6.00k   13.2ms  1.80%  2.19ΞΌs   31.5MiB  8.82%  5.38KiB
 ─────────────────────────────────────────────────────────────────────

#8

Excellent suggestion, I did not know about TimerOutputs! Thanks :slight_smile:


#9

Question every array you allocate:

Why are you making a new Pp at each iteration, why a new K and a new I_K? Why are you not using broadcasting for additions? Why allocate a new random matrix every time?


#10

I see, the random is just something to mimic the original code, that has some computations before it. I will try to follow your advices to reduce the number of allocations. Thanks!


#11

Most of the allocations seems to be in pinv anyway, so might not actually do much to preallocate everything else. That’s why doing some light profiling before going too far down the optimization hole is important.


#12

Yes, I realized that after profiling it with your suggestion @timeit.


#13

By allocating some matrices (instead of using the UniformScalling operator), I managed to reduce the allocation from 3,8 GiB to 2,9 GiB! Thanks guys :slight_smile:


#14

Off-Topic: I get an ERROR: LAPACKException(17) from this code.

julia> using LinearAlgebra

julia> function teste()
           P = Vector{Matrix}(undef, 6000)

           P[1] = zeros(18,18)

           aux1  = zeros(18,18)
           aux2  = zeros(18,18)
           K_I_K = zeros(18,18)

           @inbounds for k = 2:6000
               Pp   = P[k-1] + rand(18,18)
               K    = pinv(I + Pp)
               I_K  = I-K

               mul!(aux1, I_K, Pp)
               mul!(aux2, aux1, I_K')
               mul!(K_I_K, K, K')

               P[k] = aux1 + aux2
           end

           P
       end
teste (generic function with 1 method)

julia> teste()
ERROR: LAPACKException(17)
Stacktrace:
 [1] chklapackerror at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\lapack.jl:38 [inlined]
 [2] gesdd!(::Char, ::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\lapack.jl:1602
 [3] #svd!#60(::Bool, ::Function, ::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\svd.jl:63
 [4] #svd! at .\none:0 [inlined]
 [5] #svd#61 at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\svd.jl:106 [inlined]
 [6] #svd at .\none:0 [inlined]
 [7] pinv(::Array{Float64,2}, ::Float64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\dense.jl:1286
 [8] pinv(::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\dense.jl:1296
 [9] teste() at .\REPL[2]:12
 [10] top-level scope at none:0

What am I doing wrong/why does LAPACK complain?


#15

It can happen because of pinv. The rand in you case was letting to a matrix that pinv cannot invert. Change rand to something like I and it should work.

EDIT: No it doesn’t, maybe it is becoming too big (Inf).


#16

It worked if I use 0.01rand(18,18) for example, so I guess yes it may give too large values (inf). Thank you.


#17

There’s also Julia’s built-in performance and memory profiler, if you add this at the end of your script:

using Profile
teste() # run w/o profiling first to ensure compiled & warmed up
@profile teste()
Profile.print(maxdepth=6)

You’ll get something like the below (trimmed a bit):

361 ...usr/share/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
  14  allocation_test.jl:9; teste()
    13 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
    1  .../usr/share/julia/stdlib/v1.0/Random/src/Random.jl:241; rand
  321 allocation_test.jl:10; teste()
    318 ...are/julia/stdlib/v1.0/LinearAlgebra/src/dense.jl:1296; pinv(::Array{Float64,2})
    3   .../stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:90; +(::UniformScaling{Bool}, ::Array{Float64,2})
  22  allocation_test:11; teste()
    2 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
    6 ./operators.jl:502; *(::Array{Float64,2}, ::Array{Float64,2}, ::Adjoint{Float64,Array{Float64,2}})
    5 ./operators.jl:502; *
    5 ...a/stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:124; -(::UniformScaling{Bool}, ::Array{Float64,2})

I find this tool very useful. Increase the maxdepth for more detail. There’s also a graphical interface from within Atom if you prefer that.

Then there’s memory allocation tracking. If you start Julia with the flag --track-allocation=user and put this at the end of your script:

using Profile
teste() # run w/o profiling first to ensure compiled & warmed up
Profile.clear_malloc_data()
teste()

Then quit Julia, and locate the newly created .mem file. It will have annotated each line with how much it allocated:

        - function teste()
    50160     P = Vector{Matrix}(undef, 6000)
        -
     2752     P[1] = zeros(18,18)
        -
        0     @inbounds for k = 2:6000
 33018496         Pp   = P[k-1] + rand(18,18)
        0         K    = pinv(I + Pp)
 16597072         P[k] = (I - K)*Pp*(I - K)' + K*I*K'
        -     end
        -
        0     P
        - end

Unfortunately, as you can see above, it’s not very accurate since first of all it shows 0 bytes allocated for the pinv line, and second the total number of bytes allocated is way off (it should be ~357 MB). Not sure what’s going on there, I have the impression that this worked well in Julia 0.6 but hasn’t quite worked since. When/if this works, it is an awesome tool. Apart from the obvious use case of seeing what allocates, you can also use it to detect type instabilities.


What's in the .mem files?
#18

What does your updated code look like? I’m surprised it wasn’t possible to reduce it a lot more.


#19

I discovered something very strange when looking at your code now.

It occurred to me that this: K*I*K' should be replaced with this: K * K'to reduce allocations. And, indeed it does, but the performance result seems bizarre to me:

julia> foo(R) = R * R'
foo (generic function with 2 methods)

julia> fooI(R) = R * I * R'
fooI (generic function with 1 method)

julia> R = rand(100, 100);

julia> foo(R) == fooI(R)
true

julia> @btime foo($R);
  59.028 ΞΌs (2 allocations: 78.20 KiB)

julia> @btime fooI($R);
  38.254 ΞΌs (4 allocations: 156.41 KiB)

For small matrices, fooI is consistently faster, but when the matrix size grows larger than roughly 200x200 foo becomes faster.

This is on Julia v1.0.3 with OpenBLAS.


#20

Try @which to find out what method is called. Maybe the one with I calls a Julia implementation that is more efficient than Blas for small matrices?

julia> @which R*R'
*(A::AbstractArray{T,2} where T, B::AbstractArray{T,2} where T) in LinearAlgebra at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/matmul.jl:140

julia> @which R*I*R'
*(a, b, c, xs...) in Base at operators.jl:502