Help to reduce number of allocations

Ronis_BR · January 5, 2019, 4:46pm

Hi guys,

I need to store something like 6000 18x18 matrices. Maybe there is nothing more to do, but maybe someone can help me to reduce the number of allocations to increase the performance! Here is a simplified version of the code:

using LinearAlgebra

function teste()
    P = Vector{Matrix}(undef, 6000)

    P[1] = zeros(18,18)

    @inbounds for k = 2:6000
        Pp   = P[k-1] + rand(18,18)
        K    = pinv(I + Pp)
        P[k] = (I - K)*Pp*(I - K)' + K*I*K'
    end

    P
end

With this, I get:

julia> @btime teste()
  474.047 ms (239453 allocations: 357.04 MiB)

kristoffer.carlsson · January 5, 2019, 4:59pm

Preallocate the temporaries needed in the matrix multiplications and use mul!, also, don’t compute the same thing multiple times (I - K).

When you use static vectors (StaticArrays.jl) then it is often possible to write things in a “mathy style” and still get good performance but with heap allocated arrays it is typically not.

Ronis_BR · January 5, 2019, 5:05pm

Thanks @kristoffer.carlsson, I tried something like this:

using LinearAlgebra

function teste()
    P = Vector{Matrix}(undef, 6000)

    P[1] = zeros(18,18)

    aux1  = zeros(18,18)
    aux2  = zeros(18,18)
    K_I_K = zeros(18,18)

    @inbounds for k = 2:6000
        Pp   = P[k-1] + rand(18,18)
        K    = pinv(I + Pp)
        I_K  = I-K

        mul!(aux1, I_K, Pp)
        mul!(aux2, aux1, I_K')
        mul!(K_I_K, K, K')

        P[k] = aux1 + aux2
    end

    P
end

But it was actually worse (needed more time) without any drop in the number of allocations, did I miss something?

 julia> @btime teste()
  562.355 ms (239453 allocations: 325.82 MiB)

tkoolen · January 5, 2019, 5:07pm

The element type of P is not concrete.

kristoffer.carlsson · January 5, 2019, 5:08pm

Your P is not concretely typed either (should be Matrix{Float64} instead of just Matrix). You should just put @time on every line to see how much each line allocate, get an overview of where the major allocations are and make a strategy to work around that. Right now, you are shooting too much from the hips.

Also, please read through the performance tips section and show your efforts in applying them, @code_warntype for example.

Ronis_BR · January 5, 2019, 5:12pm

Ok, thanks guys!

kristoffer.carlsson · January 5, 2019, 5:12pm

TimerOutputs.jl can help a bit:

using LinearAlgebra
using TimerOutputs

function teste()
    reset_timer!()
    P = Vector{Matrix}(undef, 6000)

    P[1] = zeros(18,18)

    @inbounds for k = 2:6000
        @timeit "compute PP" Pp   = P[k-1] + rand(18,18)
        @timeit "compute K"  K    = pinv(I + Pp)
        @timeit "compute Pk" P[k] = (I - K)*Pp*(I - K)' + K*I*K'
    end

    print_timer()
    P
end

julia> teste();
 ─────────────────────────────────────────────────────────────────────
                              Time                   Allocations
                      ──────────────────────   ───────────────────────
   Tot / % measured:       759ms / 96.2%            357MiB / 100%

 Section      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────
 compute K     6.00k    638ms  87.4%   106μs    215MiB  60.2%  36.7KiB
 compute Pk    6.00k   79.0ms  10.8%  13.2μs    110MiB  30.9%  18.9KiB
 compute PP    6.00k   13.2ms  1.80%  2.19μs   31.5MiB  8.82%  5.38KiB
 ─────────────────────────────────────────────────────────────────────

Ronis_BR · January 5, 2019, 5:14pm

Excellent suggestion, I did not know about TimerOutputs! Thanks

DNF · January 5, 2019, 5:37pm

Question every array you allocate:

Why are you making a new Pp at each iteration, why a new K and a new I_K? Why are you not using broadcasting for additions? Why allocate a new random matrix every time?

Ronis_BR · January 5, 2019, 5:39pm

I see, the random is just something to mimic the original code, that has some computations before it. I will try to follow your advices to reduce the number of allocations. Thanks!

kristoffer.carlsson · January 5, 2019, 5:41pm

Most of the allocations seems to be in pinv anyway, so might not actually do much to preallocate everything else. That’s why doing some light profiling before going too far down the optimization hole is important.

Ronis_BR · January 5, 2019, 5:43pm

Yes, I realized that after profiling it with your suggestion @timeit.

Ronis_BR · January 5, 2019, 6:07pm

By allocating some matrices (instead of using the UniformScalling operator), I managed to reduce the allocation from 3,8 GiB to 2,9 GiB! Thanks guys

Seif_Shebl · January 5, 2019, 6:28pm

Off-Topic: I get an ERROR: LAPACKException(17) from this code.

julia> using LinearAlgebra

julia> function teste()
           P = Vector{Matrix}(undef, 6000)

           P[1] = zeros(18,18)

           aux1  = zeros(18,18)
           aux2  = zeros(18,18)
           K_I_K = zeros(18,18)

           @inbounds for k = 2:6000
               Pp   = P[k-1] + rand(18,18)
               K    = pinv(I + Pp)
               I_K  = I-K

               mul!(aux1, I_K, Pp)
               mul!(aux2, aux1, I_K')
               mul!(K_I_K, K, K')

               P[k] = aux1 + aux2
           end

           P
       end
teste (generic function with 1 method)

julia> teste()
ERROR: LAPACKException(17)
Stacktrace:
 [1] chklapackerror at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\lapack.jl:38 [inlined]
 [2] gesdd!(::Char, ::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\lapack.jl:1602
 [3] #svd!#60(::Bool, ::Function, ::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\svd.jl:63
 [4] #svd! at .\none:0 [inlined]
 [5] #svd#61 at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\svd.jl:106 [inlined]
 [6] #svd at .\none:0 [inlined]
 [7] pinv(::Array{Float64,2}, ::Float64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\dense.jl:1286
 [8] pinv(::Array{Float64,2}) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\LinearAlgebra\src\dense.jl:1296
 [9] teste() at .\REPL[2]:12
 [10] top-level scope at none:0

What am I doing wrong/why does LAPACK complain?

Ronis_BR · January 5, 2019, 6:49pm

It can happen because of pinv. The rand in you case was letting to a matrix that pinv cannot invert. Change rand to something like I and it should work.

EDIT: No it doesn’t, maybe it is becoming too big (Inf).

Seif_Shebl · January 5, 2019, 7:07pm

It worked if I use 0.01rand(18,18) for example, so I guess yes it may give too large values (inf). Thank you.

bennedich · January 5, 2019, 10:38pm

There’s also Julia’s built-in performance and memory profiler, if you add this at the end of your script:

using Profile
teste() # run w/o profiling first to ensure compiled & warmed up
@profile teste()
Profile.print(maxdepth=6)

You’ll get something like the below (trimmed a bit):

361 ...usr/share/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
  14  allocation_test.jl:9; teste()
    13 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
    1  .../usr/share/julia/stdlib/v1.0/Random/src/Random.jl:241; rand
  321 allocation_test.jl:10; teste()
    318 ...are/julia/stdlib/v1.0/LinearAlgebra/src/dense.jl:1296; pinv(::Array{Float64,2})
    3   .../stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:90; +(::UniformScaling{Bool}, ::Array{Float64,2})
  22  allocation_test:11; teste()
    2 ./arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
    6 ./operators.jl:502; *(::Array{Float64,2}, ::Array{Float64,2}, ::Adjoint{Float64,Array{Float64,2}})
    5 ./operators.jl:502; *
    5 ...a/stdlib/v1.0/LinearAlgebra/src/uniformscaling.jl:124; -(::UniformScaling{Bool}, ::Array{Float64,2})

I find this tool very useful. Increase the maxdepth for more detail. There’s also a graphical interface from within Atom if you prefer that.

Then there’s memory allocation tracking. If you start Julia with the flag --track-allocation=user and put this at the end of your script:

using Profile
teste() # run w/o profiling first to ensure compiled & warmed up
Profile.clear_malloc_data()
teste()

Then quit Julia, and locate the newly created .mem file. It will have annotated each line with how much it allocated:

        - function teste()
    50160     P = Vector{Matrix}(undef, 6000)
        -
     2752     P[1] = zeros(18,18)
        -
        0     @inbounds for k = 2:6000
 33018496         Pp   = P[k-1] + rand(18,18)
        0         K    = pinv(I + Pp)
 16597072         P[k] = (I - K)*Pp*(I - K)' + K*I*K'
        -     end
        -
        0     P
        - end

Unfortunately, as you can see above, it’s not very accurate since first of all it shows 0 bytes allocated for the pinv line, and second the total number of bytes allocated is way off (it should be ~357 MB). Not sure what’s going on there, I have the impression that this worked well in Julia 0.6 but hasn’t quite worked since. When/if this works, it is an awesome tool. Apart from the obvious use case of seeing what allocates, you can also use it to detect type instabilities.

DNF · January 5, 2019, 11:11pm

What does your updated code look like? I’m surprised it wasn’t possible to reduce it a lot more.

DNF · January 6, 2019, 9:39am

I discovered something very strange when looking at your code now.

It occurred to me that this: K*I*K' should be replaced with this: K * K'to reduce allocations. And, indeed it does, but the performance result seems bizarre to me:

julia> foo(R) = R * R'
foo (generic function with 2 methods)

julia> fooI(R) = R * I * R'
fooI (generic function with 1 method)

julia> R = rand(100, 100);

julia> foo(R) == fooI(R)
true

julia> @btime foo($R);
  59.028 μs (2 allocations: 78.20 KiB)

julia> @btime fooI($R);
  38.254 μs (4 allocations: 156.41 KiB)

For small matrices, fooI is consistently faster, but when the matrix size grows larger than roughly 200x200 foo becomes faster.

This is on Julia v1.0.3 with OpenBLAS.

baggepinnen · January 6, 2019, 10:18am

Try @which to find out what method is called. Maybe the one with I calls a Julia implementation that is more efficient than Blas for small matrices?

julia> @which R*R'
*(A::AbstractArray{T,2} where T, B::AbstractArray{T,2} where T) in LinearAlgebra at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/matmul.jl:140

julia> @which R*I*R'
*(a, b, c, xs...) in Base at operators.jl:502

Topic		Replies	Views
Optimizing Linear Algebra Code? Performance linearalgebra	12	1405	April 9, 2021
Request guidance in optimising the performance of my implementation (running time and allocation) General Usage question	9	980	April 30, 2018
Common allocation mistakes Performance memory-allocation	47	7142	August 21, 2023
Solving linear system without allocating Numerics	19	474	August 9, 2025
Why fewer memory allocations does not necessarily suggest higher speed New to Julia performance , memory-allocation	5	792	June 6, 2021

Help to reduce number of allocations

Related topics