Request guidance in optimising the performance of my implementation (running time and allocation)

I suspect that it is possible for further improve upon my implementation of a particular cost function, but I’m not sure how to proceed further. The cost function involves summing the result of a series of matrix and vector multiplications. I’m hoping someone with more experience could suggest ways to speed up the evaluation of the cost function. My current implementation allocates a lot of memory and I suspect that this could somehow be reduced.

Many thanks for any advice.

The Implementation

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

mutable struct Point2DH <: FieldVector{3,Float64}
    x::Float64
    y::Float64
    h::Float64
end

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = fill(0.0,(1,1))
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    𝐞₁ = @SVector [1.0, 0.0, 0.0]
    𝐞₂ = @SVector [0.0, 1.0, 0.0]
    for n = 1:N
        𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
        𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        𝚺ₙ⁻¹ = inv(𝚺ₙ)
        Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
    end
    Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [Point2DH(rand(3)) for i = 1:N]
ℳʹ = [Point2DH(rand(3)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost($t,$𝒞 , $𝒟)

Benchmark Results

10.695948 seconds (135.07 M allocations: 13.636 GiB, 15.68% gc time)
10.509 s (135072844 allocations: 13.64 GiB)

My Julia Version

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

Looks like StaticArrays is missing kron for two SVectors. The result is a normal Array and then we get really sad.

Thank you. To test the potential speed improvement I’ve changed the representation of static vectors to static single column matrices.

The Implementation

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = fill(0.0,(1,1))
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    # 𝐞₁ = @SVector [1.0, 0.0, 0.0]
    # 𝐞₂ = @SVector [0.0, 1.0, 0.0]
    𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
    𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
    for n = 1:N
        𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
        𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        𝚺ₙ⁻¹ = inv(𝚺ₙ)
        Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
    end
    Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [@MMatrix(rand(3,1)) for i = 1:N]
ℳʹ = [@MMatrix(rand(3,1)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost($t,$𝒞 , $𝒟)

Benchmark Results

  1.133429 seconds (13.51 M allocations: 669.848 MiB, 9.67% gc time)
  1.104 s (13507288 allocations: 669.84 MiB)

This almost amounts to a 10 fold improvement. Would it be fair to say that without parallelism, there is not much more to do here?

Some more improvements:

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = 0.0
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    # 𝐞₁ = @SVector [1.0, 0.0, 0.0]
    # 𝐞₂ = @SVector [0.0, 1.0, 0.0]
    𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
    𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
    @inbounds for n = 1:N
        index = SVector(1, 2)
        𝚲ₙ[1:2,1:2] .= Λ₁[n][index, index]
        𝚲ₙ[3:4,3:4] .= Λ₂[n][index, index]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        Jₐₘₗ += 𝛉' * 𝐔ₙ * (𝚺ₙ \ 𝐔ₙ') * 𝛉
    end
    Jₐₘₗ
end

Before:

cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
  1.375 s (13507288 allocations: 669.84 MiB)

After:

cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
  972.911 ms (3 allocations: 336 bytes)

Changes:

  • Use \ instead of inv (pretty big difference, typically numerically better as well)
  • Replace @view with indexing with SVector(1, 2) so that size is known at compile time
  • Add @inbounds annotation (minor improvement; also note that I neglected adding size checks before the loop, so this is unsafe)
  • Change Jₐₘₗ from a 1x1 matrix to a Float64 (big difference)

It depends on how general you want the code to be. Right now, 𝚲ₙ is block diagonal and the 𝐞₁ ⊗ 𝐦ʹ are quite sparse so exploiting those things would likely help.

Also ask yourself whether you care about further improvements.

Kristoffer and Twan, thank you very much for your excellent suggestions. I am astounded that it was possible to reduce everything down to just 3 allocations. I learned a lot from your posts. Thanks also to everyone else who took the time to read the code snippet and think about the problem.

Hi, thanks for this thread, quite a good example! Can I ask a related but off-topic question? What are those last 2 characters in Jₐₘₗ ? On my system (MacOS) they show up as pr [?] (closed box in both cases. I can certainly store and run the scripts with similar timing results. Thanks!

These are LATIN SUBSCRIPT SMALL LETTER (M,L).
A brilliant tool for questions like this is at Babelstone.

Thank you, indeed a nice tool!