Optimising code: Multiplying a list of matrices by a matrix

I need help optimising a piece of code which involves the computation of many small vectors and matrices. I have a list of vectors and matrices, given by mu_i and S_i respectively. I need to calculate nu_i = A*mu_i + b and P_i = A*S_i*A' + R, where A, b and R are all the correct size. This operation is typical in multi-target tracking, where a target’s spatial statistics are put through a motion (or measurement) model. This operation is completed at every time-step of simulation and it helps if it is fast.

In the code below I consider two different options: a for loop and a “vectorised” approach I previously used in Matlab. The “vectorised” approach is faster for large lists of vectors and matrices, but I’m not sure if I’m using Julia effectively. Is there anything I can do to speed up this code? The operation can be computed in parallel, but I’ve not had much success with Julia’s distributed for loop.

using LinearAlgebra
# De-vectorised code
function devectorised!(mu, S, A, ATranspose, R, simulationLength)
    n = size(mu, 2)
    for i = 1:simulationLength
        for j = 1:n
            mu[:, j] = A*mu[:, j] + u
            S[:, :, j] = A*S[:, :, j]*ATranspose + R
        end
    end
end
# Vectorised code
function vectorised(nu, P, A, ATranspose, R, simulationLength)
    (d, n) = size(nu)
    mu = copy(nu)
    S = copy(P)
    for i = 1:simulationLength
        mu = A*mu .+ u
        D = reshape(permutedims(S, (1, 3, 2)), (d*n, d))
        E = reshape(D*ATranspose, (d, n, d))
        F = reshape(permutedims(E, (1, 3, 2)), (d, d*n))
        S = reshape(A*F, (d, d, n)) .+ R
    end
    return mu, S
end
# Admin
d = 6
n = 1000
simulationLength = 3000
# Motion model
T = 20e-3
A = [Array(1.0I, 3, 3) T*Array(1.0I, 3, 3); zeros(3, 3) Array(1.0I, 3, 3)]
ATranspose = A'
u = zeros(d, 1)
u[3, 1] = -0.5*(9.81)*(T^2)
u[6, 1] = -(9.81)*(T)
# Process noise
R = 4*Array(1.0I, d, d)
# Time the functions
mu = rand(d, n)
S = rand(d, d, n)
nu = copy(mu)
P = copy(S)
@time devectorised!(mu, S, A, ATranspose, R, simulationLength)
@time (nuUpdated, PUpdated) = vectorised(nu, P, A, ATranspose, R, simulationLength)
# Check answers
maximumMeanError = maximum(abs.(mu[:] - nuUpdated[:]))
maximumCovarianceError = maximum(abs.(S[:] - PUpdated[:]))
println("The maximum mean error is ", maximumMeanError)
println("The maximum covariance error is ", maximumCovarianceError)

Furthermore, is Julia appropriate for real-time operation? I’ve previously used heavily vectorised Matlab code for real-time operation and it worked well enough; Julia should perform just as well.

Your second path of keeping things as large arrays seems like a good one to me. I haven’t completely decoded it, but one thing you can do for speed is to re-use caches, permutedims!(_S, S, ...) and mul! etc.

You can probably also write the whole thing in two lines using TensorOperations.jl, which will handle all of this reshaping and permuting for you, and is pretty smart about caching intermediate storage.

Also, matrix * is multi-threaded by default. It looks like your loop over i has to happen in sequence.

It always depends on how strict you are when you are talking about realtime. We have successfully controlled ABB robots at 250Hz with a 1.5ms maximum return time. We have also controlled several student lab processes at <100Hz with no issues. You need to keep allocations controlled though, as otherwise the GC might give you headaches for really fast applications.

Edit: The GC can of course be turned off, if the program execution does not last long and the GC invocation is troublesome. Just keep in mind that the memory consumption will continue to grow without bound while doing this.

By the way, heavily vectorized matlab is extremely wasteful with allocations. Did you run the matlab code in the matlab runtime or you used it to compile C-code or mex-functions? If you ran the matlab natively, then Julia should allow you to do much better.

I ran it using the Matlab runtime. I’m very new to Julia – I’ve been using it less than a week – and I’m far from a Matlab expert. To add some context, I wrote a multi-sensor multi-target tracking algorithm in Matlab and vectorised the code, it obtained real-time results for a moderate number of sensors and targets. However, the code doesn’t scale well with an increase in sensor number, but that part of code (the measurement update) is inherently parallel in my algorithm. Matlab’s parallel computing toolbox didn’t help, as expected, so I gave Julia a try. An initial devectorised approach was incredibly slow, so I just ported my Matlab code into Julia and it improved the situation. However, I doubt this approach plays to Julia’s strength. I haven’t quite gotten round to multi-threading yet.

Ok, yeah vectorizing the code is not always the best strategy in Julia, although the broadcasting machinery will get you further (faster) than what you can do with matlab. Writing the loops manually in julia gives you the potential to achieve fortran/C speed, but you must of course follow the performance tips to see any kind of benefit over matlab.

1 Like

If I get your code correctly, your d=6 is small enough to potentially see benefits from StaticArrays. TensorOperations as mentioned above is also worth pursuing, as you tight loop is currently allocating tons of memory (every slice x[:,:,i] is allocating a matrix).

I changed your code to use static arrays, the timing of devectorized is printed in the bottom

using LinearAlgebra, BenchmarkTools, StaticArrays
# De-vectorised code
function devectorised!(mu, S, A, ATranspose, R, simulationLength, u)
    n = length(mu)
    for i = 1:simulationLength
        for j = 1:n
            mu[j] = A*mu[j] + u
            S[j] = A*S[j]*ATranspose + R
        end
    end
end

# Admin
d = 6
n = 1000
simulationLength = 30
# Motion model
T = 20e-3
A = [Array(1.0I, 3, 3) T*Array(1.0I, 3, 3); zeros(3, 3) Array(1.0I, 3, 3)]
A = SMatrix{size(A)...}(A)
ATranspose = A'
u = zeros(d)
u[3] = -0.5*(9.81)*(T^2)
u[6] = -(9.81)*(T)
u = SVector{length(u)}(u)
# Process noise
R = 4*I
# Time the functions
mu = [@SVector rand(d) for _ in 1:n]
S = [@SMatrix rand(d, d) for _ in 1:n]
nu = copy(mu)
P = copy(S)
@btime devectorised!(mu, S, A, ATranspose, R, simulationLength, u)
@btime (nuUpdated, PUpdated) = vectorised(nu, P, A, ATranspose, R, simulationLength)
# Check answers
maximumMeanError = maximum(abs.(mu[:] - nuUpdated[:]))
maximumCovarianceError = maximum(abs.(S[:] - PUpdated[:]))
println("The maximum mean error is ", maximumMeanError)
println("The maximum covariance error is ", maximumCovarianceError)

# Before 42.852 ms (254670 allocations: 53.78 MiB)
# After 2.023 ms (0 allocations: 0 bytes)
# Vectorized  13.915 ms (965 allocations: 44.31 MiB)

One big problem with your code was that the variable u was global, I added it as an argument. I changed all arrays to static versions and I changed R to the I::UniformScaling

3 Likes

This is a huge improvement, thank you for the help.

1 Like

A while ago I asked a similar question, which you might find a useful read:

By the way, in recent versions of julia, this can be written

Matrix([I(3) T*I(3); 0I I])
# alternatively
diagm(ones(6)) + diagm(T=>4ones(3))
I(6) + diagm(3=>4ones(3))
2 Likes

Apologies to enter this thread with another topic. I have a problem which might be close to this one.
I’d like to kindly ask if anyone has experienced or could look into it. It is performance of loops with larger matrices and multiplication. For loop in function and multiplication of larger matrices, slow speed in parallel
Thanks a lot.

A belated reply: I gave TensorOperations a try, but I couldn’t quite get the addition right using tensor operations. Do you have any idea of how to perform the addition?

using LinearAlgebra, TensorOperations
# Dimensions
d = 6
n = 20
# Matrices
mu = rand(d)
A = [Array{Float64}(I, 3, 3) 0.2*Array{Float64}(I, 3, 3); zeros(3, 3) Array{Float64}(I, 3, 3)]
ATranspose = Array(A')
R = Array{Float64}(I, d, d)
S = rand(d, d, n)
# TensorOperations
@tensor begin
    PTensor[a, e, d] := A[a, b]*S[b, c, d]*ATranspose[c, e]
end
PTensor = PTensor .+ R
# Vectorisation method
D = reshape(permutedims(S, (1, 3, 2)), (d*n, d))
E = reshape(D*ATranspose, (d, n, d))
F = reshape(permutedims(E, (1, 3, 2)), (d, d*n))
PVectorised = reshape(A*F, (d, d, n)) .+ R
# Check errors
maximumError = maximum(abs.(PTensor[:] - PVectorised[:]))
println(maximumError)

I thought you could write this, but it turns out to be an error:

julia> @tensor PTensor[a, e, d] := A[a, b] * S[b, c, d] * A[e, c] + R[a,e]
ERROR: TensorOperations.IndexError{String}("add: (:a, :e) to (:a, :e, :d))")

It may be worth opening an issue about that. Usually mul! and friends can multiply and add in one step, although perhaps this is impossible with these dimensions?

You can do this, but it’s precisely the .+ line you wrote:

using TensorOperations, TensorCast
@tensor PTensor[a, e, d] := A[a, b] * S[b, c, d] * A[e, c]
@cast PTensor[a, e, d] += R[a,e]

(Some chance that you’d need to write out the += on tagged version.)

However baggepinnen’s StaticArrays suggestion is a good one, I missed that the matrices were as small as 6x6, on first reading. Not mentioned is that you can re-interpret the same array into such slices, via things like reinterpret(typeof(@SMatrix rand(6,6)), vec(PTensor)).

I did follow baggepinnen’s advice and it worked out very well, not only are the matrices small but the number of matrices is usually pretty low as well. Using tensor operations was purely out of interest.

Very useful suggestions.

@scj if you are planning to compute the inverse for S[:,:,j] and/or to use a Kalman smoother you might also want to set each element of S to be a symmetric matrix. I have implemented something similar in TSAnalysis.jl (https://github.com/fipelle/TSAnalysis.jl/blob/master/src/kalman.jl).

1 Like