Improve the performance of multiplication of an arbitrary number of matrices

Yeah, I noticed the setindex issue and I’m working on a PR to fix it right now :slight_smile:

Multiplication of two MMatrices actually produces an SMatrix, which is why there’s no allocation and why it’s super fast.

1 Like

Guys, you have convinced me. I did some tests and when I am creating the rotation matrices from Euler Angles (which is a lot of work during the simulation), I got a code 20x (!!!) faster. I think it started to be comparable to C++ using something like Eigen. I am changing the code right now :slight_smile:

Thanks for all the help!

EDIT: I am starting to think that StaticArrays should be in Base. It is amazing…

4 Likes

https://github.com/JuliaArrays/StaticArrays.jl/pull/402

2 Likes

Are you aware of https://github.com/FugroRoames/Rotations.jl?

Yes, and it does not work for me. We needed something very very close to what we have in octave and MATLAB to make conversion (and adoption) easier. That is why we decided to create ReferenceFrameRotations (which was called Rotations before, but I did not register the package).

Actually, the first version of our package was from 2014. By that time, I really do not think that there were other toolboxes to do this.

EDIT: Typo

Guys,

I found something interesting, can anyone help me? I really can’t explain.

Take a look at the following benchmark:

D1 = @SMatrix rand(3,3)
D2 = @SMatrix rand(3,3)

@inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
    result = D2*D1

    for Di in Ds
        result = Di*result
    end

    result
end

test() = D2*D1*D2*D1*D1*D2*D1*D2*D1

@btime test()
@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)

I am getting:

@btime test()
  614.213 ns (14 allocations: 1.45 KiB)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 31.6736  6.31623  21.519
 19.4169  3.87201  13.1921
 25.5171  5.08857  17.3361

@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  117.351 ns (1 allocation: 80 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 31.6736  6.31623  21.519
 19.4169  3.87201  13.1921
 25.5171  5.08857  17.3361

Why the multiplication using the function is 6x faster than the explicit multiplication?

Pass D1 and D2 to the first function as well, otherwise they are non-constant global variables.

1 Like

Hi @jebej,

EDITED: I saw an improvement, but it is still slower.

D1 = @SMatrix rand(3,3)
D2 = @SMatrix rand(3,3)

@inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
    result = D2*D1

    for Di in Ds
        result = Di*result
    end

    result
end

test(A::SMatrix{3,3},B::SMatrix{3,3}) = B*A*B*A*A*B*A*B*A

@btime test(D1,D2)
  230.404 ns (13 allocations: 1.02 KiB)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 19.8389  12.0217  9.19594
 14.1823   8.594   6.57394
 16.8129  10.1881  7.79331

@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  116.769 ns (1 allocation: 80 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 22.8812  29.4466  32.0034
 34.8185  44.8093  48.6999
 25.074   32.2687  35.0704

Currently there are allocations happening for too many consecutive multiplications / additions. Unfortunately I cannot find the issue at the moment, but as a workaround you can just add some parens to get rid of this.

julia> using StaticArrays

julia> D1 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.504081  0.829019  0.931518  
 0.84657   0.332624  0.240201  
 0.110669  0.165083  0.00691145

julia> D2 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.587738  0.0991184  0.603096
 0.322056  0.107664   0.823367
 0.930769  0.608259   0.871824

julia> @btime $D1*$D2*$D1*$D2*$D1*$D2
using  88.841 ns (6 allocations: 480 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 5.57831   2.46902   7.1448  
 3.00079   1.3265    3.8428  
 0.454491  0.200895  0.582224

julia> @btime $D1*$D2*$D1*($D2*$D1*$D2)
  41.833 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 5.57831   2.46902   7.1448  
 3.00079   1.3265    3.8428  
 0.454491  0.200895  0.582224
1 Like

Thanks! Is it some kind of bug that needs to be reported or just usual behavior?

An issue already exists. It’s due to an inference limit.

1 Like

Firstly, always interpolate variables when benchmarking with BenchmarkTools, that is, use $ in front of variables.

Secondly, what was wrong with using @Per’s excellent solution?

@inline compose(D) = D
@inline compose(D, Ds...) = D * compose(Ds...)

It’s much faster, and also cleaner and more elegant than compose_rotation:

julia> @btime compose_rotation($S1, $S2, $S3, $S4, $S3, $S2)
  83.211 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 1.71212  1.49089  1.19398
 1.73657  1.51432  1.21056
 1.25518  1.09087  0.874589

julia> @btime compose($S1, $S2, $S3, $S4, $S3, $S2)
  26.457 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 1.32624  1.1773   0.946564
 1.9006   1.68791  1.36169
 1.51673  1.34766  1.08688

(The difference in result is because you switched the multiplication order of the first two matrices in compose_rotation.)

1 Like

Good point! In my first test, the solution compose was slower than the other one. I will modify it here. Thanks :slight_smile:

Hi @ChrisRackauckas,

Given that, what kind of precautions should I have when multiplying arrays? I have equations with many matrices / vectors multiplications.

Is Julia able to perform matrix multiplications like Eigen by avoiding temporary buffers? Or is it more like several BLAS lib calls?

Hi @Gyslain,

I have no technical background to answer you about this. What I can say is that, using StaticArrays, my C++ codes are not that faster comparable to the same julia codes (lots of 3x3 matrices multiplications).

You can avoid temporary buffers like in eigen by broadcasting (a.k.a doing operations elementwise) with the dot operator additions, subtractions, and other “scalar-to-scalar” functions.

For example, for already existing matrices A,B,C and D, the following will traverse each matrix only once, and save the result in A.

A .+= cos.(B) .+ 3.*C .- D./5

For matrix multiplication, you will need a buffer to store the result, just like eigen in most cases. It is rarely useful to lazily do matrix multiplication, and generally faster to store the result in memory for further use. You can use in-place BLAS functions to do that multiplication and save superfluous allocations by reusing the buffer if you want.

1 Like

Easier written (and understood) as

@. A += cos(B) + 3*C - D/5
4 Likes
julia> using StaticArrays, BenchmarkTools

julia> const D1 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.745033   0.435679  0.27068  
 0.185017   0.810673  0.310559 
 0.0927197  0.799064  0.0810184

julia> const D2 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.966228  0.189935  0.967816
 0.637947  0.419533  0.542044
 0.103416  0.477221  0.57544 

julia> @inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
           result = D2*D1
       
           for Di in Ds
               result = Di*result
           end
       
           result
       end
compose_rotation (generic function with 1 method)

julia> @generated function compose_rotation2(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::Vararg{<:SMatrix{3,3},K}) where K
           quote
               $(Expr(:meta, :inline))
               result = D2*D1
       
               Base.Cartesian.@nexprs $K i -> ( result = Ds[i] * result )
       
               result
           end
       end
compose_rotation2 (generic function with 1 method)

julia> test(A::SMatrix{3,3},B::SMatrix{3,3}) = B*A*B*A*A*B*A*B*A
test (generic function with 1 method)

julia> @btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  73.375 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 8.64171  15.7503  4.83951
 6.57963  11.9922  3.68478
 4.21256   7.6781  2.35919

julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  13.149 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 8.64171  15.7503  4.83951
 6.57963  11.9922  3.68478
 4.21256   7.6781  2.35919

julia> @btime test(D1, D2)
  435.379 ns (12 allocations: 960 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 8.64171  15.7503  4.83951
 6.57963  11.9922  3.68478
 4.21256   7.6781  2.35919

1 Like

I am completely astonished with this result! But I really did not understand how this was achieved :blush: can you explain to me how this generated function works?