Request guidance in optimising the performance of my implementation (running time and allocation)

question

#1

I suspect that it is possible for further improve upon my implementation of a particular cost function, but I’m not sure how to proceed further. The cost function involves summing the result of a series of matrix and vector multiplications. I’m hoping someone with more experience could suggest ways to speed up the evaluation of the cost function. My current implementation allocates a lot of memory and I suspect that this could somehow be reduced.

Many thanks for any advice.

The Implementation

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

mutable struct Point2DH <: FieldVector{3,Float64}
    x::Float64
    y::Float64
    h::Float64
end

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = fill(0.0,(1,1))
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    𝐞₁ = @SVector [1.0, 0.0, 0.0]
    𝐞₂ = @SVector [0.0, 1.0, 0.0]
    for n = 1:N
        𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
        𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        𝚺ₙ⁻¹ = inv(𝚺ₙ)
        Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
    end
    Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [Point2DH(rand(3)) for i = 1:N]
ℳʹ = [Point2DH(rand(3)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost($t,$𝒞 , $𝒟)

Benchmark Results

10.695948 seconds (135.07 M allocations: 13.636 GiB, 15.68% gc time)
10.509 s (135072844 allocations: 13.64 GiB)

My Julia Version

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu


#2

Looks like StaticArrays is missing kron for two SVectors. The result is a normal Array and then we get really sad.


#3

Thank you. To test the potential speed improvement I’ve changed the representation of static vectors to static single column matrices.

The Implementation

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = fill(0.0,(1,1))
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    # 𝐞₁ = @SVector [1.0, 0.0, 0.0]
    # 𝐞₂ = @SVector [0.0, 1.0, 0.0]
    𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
    𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
    for n = 1:N
        𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
        𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        𝚺ₙ⁻¹ = inv(𝚺ₙ)
        Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
    end
    Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [@MMatrix(rand(3,1)) for i = 1:N]
ℳʹ = [@MMatrix(rand(3,1)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost($t,$𝒞 , $𝒟)

Benchmark Results

  1.133429 seconds (13.51 M allocations: 669.848 MiB, 9.67% gc time)
  1.104 s (13507288 allocations: 669.84 MiB)

This almost amounts to a 10 fold improvement. Would it be fair to say that without parallelism, there is not much more to do here?


#4

Some more improvements:

using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ℳ, ℳʹ = collect(𝒟)
    Λ₁, Λ₂ = collect(𝒞)
    Jₐₘₗ = 0.0
    N = length(𝒟[1])
    𝚲ₙ = @MMatrix zeros(4,4)
    # 𝐞₁ = @SVector [1.0, 0.0, 0.0]
    # 𝐞₂ = @SVector [0.0, 1.0, 0.0]
    𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
    𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
    @inbounds for n = 1:N
        index = SVector(1, 2)
        𝚲ₙ[1:2,1:2] .= Λ₁[n][index, index]
        𝚲ₙ[3:4,3:4] .= Λ₂[n][index, index]
        𝐦 = ℳ[n]
        𝐦ʹ= ℳʹ[n]
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        Jₐₘₗ += 𝛉' * 𝐔ₙ * (𝚺ₙ \ 𝐔ₙ') * 𝛉
    end
    Jₐₘₗ
end

Before:

cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
  1.375 s (13507288 allocations: 669.84 MiB)

After:

cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
  972.911 ms (3 allocations: 336 bytes)

Changes:

  • Use \ instead of inv (pretty big difference, typically numerically better as well)
  • Replace @view with indexing with SVector(1, 2) so that size is known at compile time
  • Add @inbounds annotation (minor improvement; also note that I neglected adding size checks before the loop, so this is unsafe)
  • Change Jₐₘₗ from a 1x1 matrix to a Float64 (big difference)

#5

It depends on how general you want the code to be. Right now, 𝚲ₙ is block diagonal and the 𝐞₁ ⊗ 𝐦ʹ are quite sparse so exploiting those things would likely help.


#6

Also ask yourself whether you care about further improvements.


#7

Kristoffer and Twan, thank you very much for your excellent suggestions. I am astounded that it was possible to reduce everything down to just 3 allocations. I learned a lot from your posts. Thanks also to everyone else who took the time to read the code snippet and think about the problem.


#8

Hi, thanks for this thread, quite a good example! Can I ask a related but off-topic question? What are those last 2 characters in Jₐₘₗ ? On my system (MacOS) they show up as [] pr [?] (closed box in both cases. I can certainly store and run the scripts with similar timing results. Thanks!


#9

These are LATIN SUBSCRIPT SMALL LETTER (M,L).
A brilliant tool for questions like this is at Babelstone.


#10

Thank you, indeed a nice tool!