# Request guidance in optimising the performance of my implementation (running time and allocation)

#1

I suspect that it is possible for further improve upon my implementation of a particular cost function, but I’m not sure how to proceed further. The cost function involves summing the result of a series of matrix and vector multiplications. I’m hoping someone with more experience could suggest ways to speed up the evaluation of the cost function. My current implementation allocates a lot of memory and I suspect that this could somehow be reduced.

#### The Implementation

``````using StaticArrays
using BenchmarkTools

const  ⊗ = kron

mutable struct Point2DH <: FieldVector{3,Float64}
x::Float64
y::Float64
h::Float64
end

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
ℳ, ℳʹ = collect(𝒟)
Λ₁, Λ₂ = collect(𝒞)
Jₐₘₗ = fill(0.0,(1,1))
N = length(𝒟[1])
𝚲ₙ = @MMatrix zeros(4,4)
𝐞₁ = @SVector [1.0, 0.0, 0.0]
𝐞₂ = @SVector [0.0, 1.0, 0.0]
for n = 1:N
𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
𝐦 = ℳ[n]
𝐦ʹ= ℳʹ[n]
𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
𝚺ₙ⁻¹ = inv(𝚺ₙ)
Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
end
Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [Point2DH(rand(3)) for i = 1:N]
ℳʹ = [Point2DH(rand(3)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost(\$t,\$𝒞 , \$𝒟)
``````

#### Benchmark Results

``````10.695948 seconds (135.07 M allocations: 13.636 GiB, 15.68% gc time)
10.509 s (135072844 allocations: 13.64 GiB)
``````

#### My Julia Version

``````               _
_       _ _(_)_     |  A fresh approach to technical computing
(_)     | (_) (_)    |  Documentation: https://docs.julialang.org
_ _   _| |_  __ _   |  Type "?help" for help.
| | | | | | |/ _` |  |
| | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
_/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

``````

#2

Looks like StaticArrays is missing `kron` for two `SVector`s. The result is a normal `Array` and then we get really sad.

#3

Thank you. To test the potential speed improvement I’ve changed the representation of static vectors to static single column matrices.

#### The Implementation

``````using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
ℳ, ℳʹ = collect(𝒟)
Λ₁, Λ₂ = collect(𝒞)
Jₐₘₗ = fill(0.0,(1,1))
N = length(𝒟[1])
𝚲ₙ = @MMatrix zeros(4,4)
# 𝐞₁ = @SVector [1.0, 0.0, 0.0]
# 𝐞₂ = @SVector [0.0, 1.0, 0.0]
𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
for n = 1:N
𝚲ₙ[1:2,1:2] .= @view Λ₁[n][1:2,1:2]
𝚲ₙ[3:4,3:4] .= @view Λ₂[n][1:2,1:2]
𝐦 = ℳ[n]
𝐦ʹ= ℳʹ[n]
𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
𝚺ₙ⁻¹ = inv(𝚺ₙ)
Jₐₘₗ .= Jₐₘₗ + 𝛉' * 𝐔ₙ * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉
end
Jₐₘₗ[1]
end

# Some sample data
N = 3376821
ℳ = [@MMatrix(rand(3,1)) for i = 1:N]
ℳʹ = [@MMatrix(rand(3,1)) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(diagm([1.0,1.0,0.0])) for i = 1:length(ℳ)]
t = @MVector rand(9)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

cost(t,𝒞 , 𝒟)
@time cost(t,𝒞 , 𝒟)
@btime cost(\$t,\$𝒞 , \$𝒟)

``````

#### Benchmark Results

``````  1.133429 seconds (13.51 M allocations: 669.848 MiB, 9.67% gc time)
1.104 s (13507288 allocations: 669.84 MiB)
``````

This almost amounts to a 10 fold improvement. Would it be fair to say that without parallelism, there is not much more to do here?

#4

Some more improvements:

``````using StaticArrays
using BenchmarkTools

const  ⊗ = kron

function cost(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
ℳ, ℳʹ = collect(𝒟)
Λ₁, Λ₂ = collect(𝒞)
Jₐₘₗ = 0.0
N = length(𝒟[1])
𝚲ₙ = @MMatrix zeros(4,4)
# 𝐞₁ = @SVector [1.0, 0.0, 0.0]
# 𝐞₂ = @SVector [0.0, 1.0, 0.0]
𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
@inbounds for n = 1:N
index = SVector(1, 2)
𝚲ₙ[1:2,1:2] .= Λ₁[n][index, index]
𝚲ₙ[3:4,3:4] .= Λ₂[n][index, index]
𝐦 = ℳ[n]
𝐦ʹ= ℳʹ[n]
𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
Jₐₘₗ += 𝛉' * 𝐔ₙ * (𝚺ₙ \ 𝐔ₙ') * 𝛉
end
Jₐₘₗ
end
``````

Before:

``````cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
1.375 s (13507288 allocations: 669.84 MiB)
``````

After:

``````cost(t, 𝒞, 𝒟) = 1.4694106525431348e6
972.911 ms (3 allocations: 336 bytes)
``````

Changes:

• Use `\` instead of `inv` (pretty big difference, typically numerically better as well)
• Replace `@view` with indexing with `SVector(1, 2)` so that size is known at compile time
• Add `@inbounds` annotation (minor improvement; also note that I neglected adding size checks before the loop, so this is unsafe)
• Change `Jₐₘₗ` from a 1x1 matrix to a `Float64` (big difference)

#5

It depends on how general you want the code to be. Right now, `𝚲ₙ` is block diagonal and the `𝐞₁ ⊗ 𝐦ʹ` are quite sparse so exploiting those things would likely help.

#6

#7

Kristoffer and Twan, thank you very much for your excellent suggestions. I am astounded that it was possible to reduce everything down to just 3 allocations. I learned a lot from your posts. Thanks also to everyone else who took the time to read the code snippet and think about the problem.

#8

Hi, thanks for this thread, quite a good example! Can I ask a related but off-topic question? What are those last 2 characters in Jₐₘₗ ? On my system (MacOS) they show up as [] pr [?] (closed box in both cases. I can certainly store and run the scripts with similar timing results. Thanks!

#9

These are LATIN SUBSCRIPT SMALL LETTER (M,L).
A brilliant tool for questions like this is at Babelstone.

#10

Thank you, indeed a nice tool!