Improve the performance of multiplication of an arbitrary number of matrices


#1

Hi guys!

I need to make a function that multiplies an arbitrary number of matrices. My current attempt is the following:

@inline function compose_rotation(D1::Matrix, Ds::Matrix...)
    for Di in Ds
        D1 = Di*D1
    end

    D1
end

The problem is that the performance is much worse than only the multiplication:

D1 = rand(3,3)
D2 = rand(3,3)

@time for i=1:10000000
    compose_rotation(D1,D2,D1,D2)
end

2.406881 seconds (30.00 M allocations: 4.470 GiB, 1.90% gc time)

@time for i=1:10000000
    D1*D2*D1*D2
end

2.004193 seconds (30.00 M allocations: 4.470 GiB, 2.44% gc time)

I am wondering: since I will always know how many matrices it is necessary to multiply at compile time, is there any way to “unroll” to loop so that I can improve the performance?

P.S.: This seems a little silly, but this function will be used to compute a composed rotation regardless the choice of the rotation description (matrix, quaternion, Euler axis/angle, etc.).


#2
  1. You’re timing in global scope, so your results are not going to be informative. For a discussion of the same issue from earlier today, see: How to assign real/imaginary part of complex array

  2. You’ll get substantially better performance if you use https://github.com/JuliaArrays/StaticArrays.jl for your small fixed-size attrays.


#3

You also might be interested in https://github.com/FugroRoames/CoordinateTransformations.jl which already supports compositions of transformations and has been stable and performant for quite a while.


#4

Thanks!

But this is actually for my package:

I am almost finishing a toolbox with functions related to Satellite simulations. In this toolbox, there are a lot of functions to create rotations between reference frames (J2000, GCRF, PEF, TOD, MOD, TEME, etc.). I want that each function can compute the rotation using Quaternions ou DCMs. If I do not have this compose_rotation function, then I will have to create two functions for each rotation (which is a lot, really…).

I will try to benchmark inside a function and will post the results here!


#5

I’m not sure that benchmarking in a function will make a big difference in this specific case (although you should never benchmark in the global scope, and you should definitely check out BenchmarkTools, I never benchmark with anything else).

What will make a difference is using StaticArrays, you’ll get 5-10x speedup, probably, at least if you avoid the splatting and the loop. You can unroll the loop using @generated functions.

But, question: If you always know the number composed rotations at compile time, can you just avoid the whole compose function, and just type out the multiplication inline in your code, that is, just put D1 * D2 * D3 * D4 in there?


#6

BTW, for reasons that are unclear to me, I get significantly improved performance if, in addition to using StaticArrays, I add parens, like this:

((D1 * D2) * D3) * D4

#7
using StaticArrays
using BenchmarkTools

function composeN(D1, Ds...)
   for D in Ds
       D1 = D1 * D
   end
   return D1
end

compose4(D1, D2, D3, D4) = D1 * D2 * D3 * D4
compose4_(D1, D2, D3, D4) = ((D1 * D2) * D3) * D4
@inline compose4i(D1, D2, D3, D4) = ((D1 * D2) * D3) * D4

D1 = rand(3, 3)
D2 = rand(3, 3)
D3 = rand(3, 3)
D4 = rand(3, 3)

S1 = @SMatrix rand(3, 3)
S2 = @SMatrix rand(3, 3)
S3 = @SMatrix rand(3, 3)
S4 = @SMatrix rand(3, 3)

@btime composeN($D1, $D2, $D3, $D4)
@btime compose4($D1, $D2, $D3, $D4)
@btime compose4_($D1, $D2, $D3, $D4)
@btime compose4i($D1, $D2, $D3, $D4)

@btime composeN($S1, $S2, $S3, $S4)
@btime compose4($S1, $S2, $S3, $S4)
@btime compose4_($S1, $S2, $S3, $S4)
@btime compose4i($S1, $S2, $S3, $S4)

Output:

225.324 ns (3 allocations: 480 bytes)
200.666 ns (3 allocations: 480 bytes)
203.422 ns (3 allocations: 480 bytes)
201.090 ns (3 allocations: 480 bytes)
 
84.016 ns (5 allocations: 400 bytes)
33.937 ns (0 allocations: 0 bytes)
23.829 ns (0 allocations: 0 bytes)
14.082 ns (0 allocations: 0 bytes)

Not sure why @inline and the parens make a difference, but I would assume that when used in other code, compose4 would get inlined anyway.

BTW: Note that if you divide by the number of times you are looping over this in your code (1:10_000_000) the timing and memory use is similar to what you get with BenchmarkTools, which means that global scope vs function is not making a big difference in this specific case.


#8

Hi @DNF,

Thanks for the tips! I need to create a function for two reasons:

  1. The compiler will know the number of matrices, but it can be any, depending on the usage.
  2. I would like to have 1 function to compose rotations for different representations. For example, let’s say that you have three rotations R1, R2, and R3. If you use DCMs, then the composition is R3*R2*R1. However, if you are using quaternions, then it is R1*R2*R3. So, I would like one specialized function to make this kind of operation.

#9

To get the speed of compose4i for a variable number of arguments, you could try:

@inline composeNi(D) = D
@inline composeNi(D, Ds...) = D*composeNi(Ds...)

#10

That is a super simple solution, and is equivalent to

D1 * (D2 * (D3 * D4))

which, oddly, is ever so slightly slower than

((D1 * D2) * D3) * D4

I tried to achieve something similar with @generated functions, but that was a disaster, 5x as slow and with allocations, even though the generated code looked like it should do the exact same thing.


#11

Hi @Per!

Thanks for the tip. However, I found the optimal configuration was the following:

@inline compose_rotation(D1::Matrix, D2::Matrix) = D2*D1
@inline compose_rotation(D1::Matrix, D2::Matrix, D3::Matrix) = D3*D2*D1
@inline compose_rotation(D1::Matrix, D2::Matrix, D3::Matrix, D4::Matrix) =
    D4*D3*D2*D1
@inline compose_rotation(D1::Matrix,
                         D2::Matrix,
                         D3::Matrix,
                         D4::Matrix,
                         D5::Matrix) =
    D5*D4*D3*D2*D1

@inline function compose_rotation(D1::Matrix, D2::Matrix, Ds::Matrix...)
    result = D2*D1

    for Di in Ds
        result = Di*result
    end

    result
end

Using your suggestion for 6 matrices, I got an average execution time of 581 ns, whereas I measured 430 ns with the above code.

For unknown reason here, when I have the multiplication without the parenthesis, it is faster than with them. Almost 10% faster.

I also tried with @generated functions and saw something like 7 ou 8x slower! Do you mind to share your solution?

By the way, I am a little reluctant to change everything to immutable StaticArrays. This code was being used in my institute by people that are moving from MATLAB (or want better performance). The fact that I can’t change an element after the definition should cause problems.

Thanks!


#12

It’s true that working with immutable static matrices can be less convenient, but the performance improvements can be huge. For example:

julia> x = rand(3, 3);

julia> smatrix_x = SMatrix{3, 3}(x);

julia> @btime $x * $x;
  54.556 ns (1 allocation: 160 bytes)

julia> @btime $smatrix_x * $smatrix_x;
  7.940 ns (0 allocations: 0 bytes)

The static matrix version is more than 6 times faster!

If you really need mutability, there’s also the MMatrix type from StaticArrays, which is fixed-size but mutable. It’s generally not as fast as the immutable version, but it can be useful as well. For this simple multiplication benchmark, it’s actually just as fast:

julia> mmatrix_x = MMatrix{3, 3}(x);

julia> @btime $mmatrix_x * $mmatrix_x;
  7.940 ns (0 allocations: 0 bytes)

#13

On my pc, composing four matrices is fifteen times as fast with SMatrix or MMatrix as with ordinary Arrays. That’s a lot of performance to leave on the table.

It is also supposed to be possible to “mutate” an SMatrix, except it seems not be working now.

BTW, I wonder why SMatrix and MMatrix now seem to be equally fast.


#14

Yeah, I noticed the setindex issue and I’m working on a PR to fix it right now :slight_smile:

Multiplication of two MMatrices actually produces an SMatrix, which is why there’s no allocation and why it’s super fast.


#15

Guys, you have convinced me. I did some tests and when I am creating the rotation matrices from Euler Angles (which is a lot of work during the simulation), I got a code 20x (!!!) faster. I think it started to be comparable to C++ using something like Eigen. I am changing the code right now :slight_smile:

Thanks for all the help!

EDIT: I am starting to think that StaticArrays should be in Base. It is amazing…


#16

#17

Are you aware of https://github.com/FugroRoames/Rotations.jl?


#18

Yes, and it does not work for me. We needed something very very close to what we have in octave and MATLAB to make conversion (and adoption) easier. That is why we decided to create ReferenceFrameRotations (which was called Rotations before, but I did not register the package).

Actually, the first version of our package was from 2014. By that time, I really do not think that there were other toolboxes to do this.

EDIT: Typo


#19

Guys,

I found something interesting, can anyone help me? I really can’t explain.

Take a look at the following benchmark:

D1 = @SMatrix rand(3,3)
D2 = @SMatrix rand(3,3)

@inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
    result = D2*D1

    for Di in Ds
        result = Di*result
    end

    result
end

test() = D2*D1*D2*D1*D1*D2*D1*D2*D1

@btime test()
@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)

I am getting:

@btime test()
  614.213 ns (14 allocations: 1.45 KiB)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 31.6736  6.31623  21.519
 19.4169  3.87201  13.1921
 25.5171  5.08857  17.3361

@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  117.351 ns (1 allocation: 80 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 31.6736  6.31623  21.519
 19.4169  3.87201  13.1921
 25.5171  5.08857  17.3361

Why the multiplication using the function is 6x faster than the explicit multiplication?


#20

Pass D1 and D2 to the first function as well, otherwise they are non-constant global variables.