Request guidance in optimising the performance of my implementation (running time and allocation)

zygmuntszpak · April 25, 2018, 11:05am

I suspect that it is possible for further improve upon my implementation of a particular cost function, but I’m not sure how to proceed further. The cost function involves summing the result of a series of matrix and vector multiplications. I’m hoping someone with more experience could suggest ways to speed up the evaluation of the cost function. My current implementation allocates a lot of memory and I suspect that this could somehow be reduced.

Many thanks for any advice.

The Implementation

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

mutable struct Point2DH <: FieldVector{3,Float64}
    x::Float64
    y::Float64
    h::Float64
end

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = fill(0.0,(1,1))
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    𝐞₁ = @SVector [1.0, 0.0, 0.0]
    𝐞₂ = @SVector [0.0, 1.0, 0.0]
    for n = 1:N
        𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
        𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        𝚺ₙ⁻¹ = inv(𝚺ₙ)
        Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
    end
    Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [Point2DH(rand(3)) for i = 1:N]
ℳʹ = [Point2DH(rand(3)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost($t,$𝒞 , $𝒟)

Benchmark Results

10.695948 seconds (135.07 M allocations: 13.636 GiB, 15.68% gc time)
10.509 s (135072844 allocations: 13.64 GiB)

My Julia Version

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

kristoffer.carlsson · April 25, 2018, 1:04pm

Looks like StaticArrays is missing kron for two SVectors. The result is a normal Array and then we get really sad.

zygmuntszpak · April 26, 2018, 12:17am

Thank you. To test the potential speed improvement I’ve changed the representation of static vectors to static single column matrices.

The Implementation

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = fill(0.0,(1,1))
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    # 𝐞₁ = @SVector [1.0, 0.0, 0.0]
    # 𝐞₂ = @SVector [0.0, 1.0, 0.0]
    𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
    𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
    for n = 1:N
        𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
        𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        𝚺ₙ⁻¹ = inv(𝚺ₙ)
        Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
    end
    Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [@MMatrix(rand(3,1)) for i = 1:N]
ℳʹ = [@MMatrix(rand(3,1)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost($t,$𝒞 , $𝒟)

Benchmark Results

  1.133429 seconds (13.51 M allocations: 669.848 MiB, 9.67% gc time)
  1.104 s (13507288 allocations: 669.84 MiB)

This almost amounts to a 10 fold improvement. Would it be fair to say that without parallelism, there is not much more to do here?

tkoolen · April 26, 2018, 1:01am

Some more improvements:

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = 0.0
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    # 𝐞₁ = @SVector [1.0, 0.0, 0.0]
    # 𝐞₂ = @SVector [0.0, 1.0, 0.0]
    𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
    𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
    @inbounds for n = 1:N
        index = SVector(1, 2)
        𝚲ₙ[1:2,1:2] .= Λ₁[n][index, index]
        𝚲ₙ[3:4,3:4] .= Λ₂[n][index, index]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        Jₐₘₗ += 𝛉' * 𝐔ₙ * (𝚺ₙ \ 𝐔ₙ') * 𝛉
    end
    Jₐₘₗ
end

Before:

cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
  1.375 s (13507288 allocations: 669.84 MiB)

After:

cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
  972.911 ms (3 allocations: 336 bytes)

Changes:

Use \ instead of inv (pretty big difference, typically numerically better as well)
Replace @view with indexing with SVector(1, 2) so that size is known at compile time
Add @inbounds annotation (minor improvement; also note that I neglected adding size checks before the loop, so this is unsafe)
Change Jₐₘₗ from a 1x1 matrix to a Float64 (big difference)

kristoffer.carlsson · April 26, 2018, 9:16am

It depends on how general you want the code to be. Right now, 𝚲ₙ is block diagonal and the 𝐞₁ ⊗ 𝐦ʹ are quite sparse so exploiting those things would likely help.

cortner · April 26, 2018, 3:55pm

Also ask yourself whether you care about further improvements.

zygmuntszpak · April 29, 2018, 11:02am

Kristoffer and Twan, thank you very much for your excellent suggestions. I am astounded that it was possible to reduce everything down to just 3 allocations. I learned a lot from your posts. Thanks also to everyone else who took the time to read the code snippet and think about the problem.

goedman · April 29, 2018, 2:07pm

Hi, thanks for this thread, quite a good example! Can I ask a related but off-topic question? What are those last 2 characters in Jₐₘₗ ? On my system (MacOS) they show up as pr [?] (closed box in both cases. I can certainly store and run the scripts with similar timing results. Thanks!

Ralph_Smith · April 30, 2018, 4:03am

These are LATIN SUBSCRIPT SMALL LETTER (M,L).
A brilliant tool for questions like this is at Babelstone.

goedman · April 30, 2018, 11:57am

Thank you, indeed a nice tool!

Topic		Replies	Views
Help to reduce number of allocations General Usage	23	5490	January 6, 2019
Common allocation mistakes Performance memory-allocation	47	7168	August 21, 2023
Inexplicable allocations when summing `StaticArrays` Performance	8	1126	December 3, 2018
Static array multiplied by its adjoint seems to allocate Performance linearalgebra , memory-allocation , staticarrays , allocations	12	236	December 2, 2024
Help with allocations (variable length least squares) New to Julia memory-allocation	10	1160	March 25, 2021

Request guidance in optimising the performance of my implementation (running time and allocation)

The Implementation

Benchmark Results

My Julia Version

The Implementation

Benchmark Results

Related topics