Yeah, I noticed the setindex
issue and I’m working on a PR to fix it right now
Multiplication of two MMatrices actually produces an SMatrix, which is why there’s no allocation and why it’s super fast.
Yeah, I noticed the setindex
issue and I’m working on a PR to fix it right now
Multiplication of two MMatrices actually produces an SMatrix, which is why there’s no allocation and why it’s super fast.
Guys, you have convinced me. I did some tests and when I am creating the rotation matrices from Euler Angles (which is a lot of work during the simulation), I got a code 20x (!!!) faster. I think it started to be comparable to C++ using something like Eigen. I am changing the code right now
Thanks for all the help!
EDIT: I am starting to think that StaticArrays
should be in Base
. It is amazing…
Yes, and it does not work for me. We needed something very very close to what we have in octave and MATLAB to make conversion (and adoption) easier. That is why we decided to create ReferenceFrameRotations
(which was called Rotations before, but I did not register the package).
Actually, the first version of our package was from 2014. By that time, I really do not think that there were other toolboxes to do this.
EDIT: Typo
Guys,
I found something interesting, can anyone help me? I really can’t explain.
Take a look at the following benchmark:
D1 = @SMatrix rand(3,3)
D2 = @SMatrix rand(3,3)
@inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
result = D2*D1
for Di in Ds
result = Di*result
end
result
end
test() = D2*D1*D2*D1*D1*D2*D1*D2*D1
@btime test()
@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
I am getting:
@btime test()
614.213 ns (14 allocations: 1.45 KiB)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
31.6736 6.31623 21.519
19.4169 3.87201 13.1921
25.5171 5.08857 17.3361
@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
117.351 ns (1 allocation: 80 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
31.6736 6.31623 21.519
19.4169 3.87201 13.1921
25.5171 5.08857 17.3361
Why the multiplication using the function is 6x faster than the explicit multiplication?
Pass D1 and D2 to the first function as well, otherwise they are non-constant global variables.
Hi @jebej,
EDITED: I saw an improvement, but it is still slower.
D1 = @SMatrix rand(3,3)
D2 = @SMatrix rand(3,3)
@inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
result = D2*D1
for Di in Ds
result = Di*result
end
result
end
test(A::SMatrix{3,3},B::SMatrix{3,3}) = B*A*B*A*A*B*A*B*A
@btime test(D1,D2)
230.404 ns (13 allocations: 1.02 KiB)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
19.8389 12.0217 9.19594
14.1823 8.594 6.57394
16.8129 10.1881 7.79331
@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
116.769 ns (1 allocation: 80 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
22.8812 29.4466 32.0034
34.8185 44.8093 48.6999
25.074 32.2687 35.0704
Currently there are allocations happening for too many consecutive multiplications / additions. Unfortunately I cannot find the issue at the moment, but as a workaround you can just add some parens to get rid of this.
julia> using StaticArrays
julia> D1 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
0.504081 0.829019 0.931518
0.84657 0.332624 0.240201
0.110669 0.165083 0.00691145
julia> D2 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
0.587738 0.0991184 0.603096
0.322056 0.107664 0.823367
0.930769 0.608259 0.871824
julia> @btime $D1*$D2*$D1*$D2*$D1*$D2
using 88.841 ns (6 allocations: 480 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
5.57831 2.46902 7.1448
3.00079 1.3265 3.8428
0.454491 0.200895 0.582224
julia> @btime $D1*$D2*$D1*($D2*$D1*$D2)
41.833 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
5.57831 2.46902 7.1448
3.00079 1.3265 3.8428
0.454491 0.200895 0.582224
Thanks! Is it some kind of bug that needs to be reported or just usual behavior?
An issue already exists. It’s due to an inference limit.
Firstly, always interpolate variables when benchmarking with BenchmarkTools
, that is, use $
in front of variables.
Secondly, what was wrong with using @Per’s excellent solution?
@inline compose(D) = D
@inline compose(D, Ds...) = D * compose(Ds...)
It’s much faster, and also cleaner and more elegant than compose_rotation
:
julia> @btime compose_rotation($S1, $S2, $S3, $S4, $S3, $S2)
83.211 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
1.71212 1.49089 1.19398
1.73657 1.51432 1.21056
1.25518 1.09087 0.874589
julia> @btime compose($S1, $S2, $S3, $S4, $S3, $S2)
26.457 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
1.32624 1.1773 0.946564
1.9006 1.68791 1.36169
1.51673 1.34766 1.08688
(The difference in result is because you switched the multiplication order of the first two matrices in compose_rotation
.)
Good point! In my first test, the solution compose
was slower than the other one. I will modify it here. Thanks
Hi @ChrisRackauckas,
Given that, what kind of precautions should I have when multiplying arrays? I have equations with many matrices / vectors multiplications.
Is Julia able to perform matrix multiplications like Eigen by avoiding temporary buffers? Or is it more like several BLAS lib calls?
Hi @Gyslain,
I have no technical background to answer you about this. What I can say is that, using StaticArrays, my C++ codes are not that faster comparable to the same julia codes (lots of 3x3 matrices multiplications).
You can avoid temporary buffers like in eigen by broadcasting (a.k.a doing operations elementwise) with the dot operator additions, subtractions, and other “scalar-to-scalar” functions.
For example, for already existing matrices A,B,C and D, the following will traverse each matrix only once, and save the result in A.
A .+= cos.(B) .+ 3.*C .- D./5
For matrix multiplication, you will need a buffer to store the result, just like eigen in most cases. It is rarely useful to lazily do matrix multiplication, and generally faster to store the result in memory for further use. You can use in-place BLAS functions to do that multiplication and save superfluous allocations by reusing the buffer if you want.
Easier written (and understood) as
@. A += cos(B) + 3*C - D/5
julia> using StaticArrays, BenchmarkTools
julia> const D1 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
0.745033 0.435679 0.27068
0.185017 0.810673 0.310559
0.0927197 0.799064 0.0810184
julia> const D2 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
0.966228 0.189935 0.967816
0.637947 0.419533 0.542044
0.103416 0.477221 0.57544
julia> @inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
result = D2*D1
for Di in Ds
result = Di*result
end
result
end
compose_rotation (generic function with 1 method)
julia> @generated function compose_rotation2(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::Vararg{<:SMatrix{3,3},K}) where K
quote
$(Expr(:meta, :inline))
result = D2*D1
Base.Cartesian.@nexprs $K i -> ( result = Ds[i] * result )
result
end
end
compose_rotation2 (generic function with 1 method)
julia> test(A::SMatrix{3,3},B::SMatrix{3,3}) = B*A*B*A*A*B*A*B*A
test (generic function with 1 method)
julia> @btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
73.375 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
8.64171 15.7503 4.83951
6.57963 11.9922 3.68478
4.21256 7.6781 2.35919
julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2)
13.149 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
8.64171 15.7503 4.83951
6.57963 11.9922 3.68478
4.21256 7.6781 2.35919
julia> @btime test(D1, D2)
435.379 ns (12 allocations: 960 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
8.64171 15.7503 4.83951
6.57963 11.9922 3.68478
4.21256 7.6781 2.35919
I am completely astonished with this result! But I really did not understand how this was achieved can you explain to me how this generated function works?