Improve the performance of multiplication of an arbitrary number of matrices

Hi @jebej,

EDITED: I saw an improvement, but it is still slower.

D1 = @SMatrix rand(3,3)
D2 = @SMatrix rand(3,3)

@inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
    result = D2*D1

    for Di in Ds
        result = Di*result
    end

    result
end

test(A::SMatrix{3,3},B::SMatrix{3,3}) = B*A*B*A*A*B*A*B*A

@btime test(D1,D2)
  230.404 ns (13 allocations: 1.02 KiB)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 19.8389  12.0217  9.19594
 14.1823   8.594   6.57394
 16.8129  10.1881  7.79331

@btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  116.769 ns (1 allocation: 80 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 22.8812  29.4466  32.0034
 34.8185  44.8093  48.6999
 25.074   32.2687  35.0704

Currently there are allocations happening for too many consecutive multiplications / additions. Unfortunately I cannot find the issue at the moment, but as a workaround you can just add some parens to get rid of this.

julia> using StaticArrays

julia> D1 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.504081  0.829019  0.931518  
 0.84657   0.332624  0.240201  
 0.110669  0.165083  0.00691145

julia> D2 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.587738  0.0991184  0.603096
 0.322056  0.107664   0.823367
 0.930769  0.608259   0.871824

julia> @btime $D1*$D2*$D1*$D2*$D1*$D2
using  88.841 ns (6 allocations: 480 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 5.57831   2.46902   7.1448  
 3.00079   1.3265    3.8428  
 0.454491  0.200895  0.582224

julia> @btime $D1*$D2*$D1*($D2*$D1*$D2)
  41.833 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 5.57831   2.46902   7.1448  
 3.00079   1.3265    3.8428  
 0.454491  0.200895  0.582224
1 Like

Thanks! Is it some kind of bug that needs to be reported or just usual behavior?

An issue already exists. It’s due to an inference limit.

1 Like

Firstly, always interpolate variables when benchmarking with BenchmarkTools, that is, use $ in front of variables.

Secondly, what was wrong with using @Per’s excellent solution?

@inline compose(D) = D
@inline compose(D, Ds...) = D * compose(Ds...)

It’s much faster, and also cleaner and more elegant than compose_rotation:

julia> @btime compose_rotation($S1, $S2, $S3, $S4, $S3, $S2)
  83.211 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 1.71212  1.49089  1.19398
 1.73657  1.51432  1.21056
 1.25518  1.09087  0.874589

julia> @btime compose($S1, $S2, $S3, $S4, $S3, $S2)
  26.457 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 1.32624  1.1773   0.946564
 1.9006   1.68791  1.36169
 1.51673  1.34766  1.08688

(The difference in result is because you switched the multiplication order of the first two matrices in compose_rotation.)

1 Like

Good point! In my first test, the solution compose was slower than the other one. I will modify it here. Thanks :slight_smile:

Hi @ChrisRackauckas,

Given that, what kind of precautions should I have when multiplying arrays? I have equations with many matrices / vectors multiplications.

Is Julia able to perform matrix multiplications like Eigen by avoiding temporary buffers? Or is it more like several BLAS lib calls?

Hi @Gyslain,

I have no technical background to answer you about this. What I can say is that, using StaticArrays, my C++ codes are not that faster comparable to the same julia codes (lots of 3x3 matrices multiplications).

You can avoid temporary buffers like in eigen by broadcasting (a.k.a doing operations elementwise) with the dot operator additions, subtractions, and other “scalar-to-scalar” functions.

For example, for already existing matrices A,B,C and D, the following will traverse each matrix only once, and save the result in A.

A .+= cos.(B) .+ 3.*C .- D./5

For matrix multiplication, you will need a buffer to store the result, just like eigen in most cases. It is rarely useful to lazily do matrix multiplication, and generally faster to store the result in memory for further use. You can use in-place BLAS functions to do that multiplication and save superfluous allocations by reusing the buffer if you want.

1 Like

Easier written (and understood) as

@. A += cos(B) + 3*C - D/5
4 Likes
julia> using StaticArrays, BenchmarkTools

julia> const D1 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.745033   0.435679  0.27068  
 0.185017   0.810673  0.310559 
 0.0927197  0.799064  0.0810184

julia> const D2 = @SMatrix rand(3,3)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.966228  0.189935  0.967816
 0.637947  0.419533  0.542044
 0.103416  0.477221  0.57544 

julia> @inline function compose_rotation(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::SMatrix{3,3}...)
           result = D2*D1
       
           for Di in Ds
               result = Di*result
           end
       
           result
       end
compose_rotation (generic function with 1 method)

julia> @generated function compose_rotation2(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::Vararg{<:SMatrix{3,3},K}) where K
           quote
               $(Expr(:meta, :inline))
               result = D2*D1
       
               Base.Cartesian.@nexprs $K i -> ( result = Ds[i] * result )
       
               result
           end
       end
compose_rotation2 (generic function with 1 method)

julia> test(A::SMatrix{3,3},B::SMatrix{3,3}) = B*A*B*A*A*B*A*B*A
test (generic function with 1 method)

julia> @btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  73.375 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 8.64171  15.7503  4.83951
 6.57963  11.9922  3.68478
 4.21256   7.6781  2.35919

julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  13.149 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 8.64171  15.7503  4.83951
 6.57963  11.9922  3.68478
 4.21256   7.6781  2.35919

julia> @btime test(D1, D2)
  435.379 ns (12 allocations: 960 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 8.64171  15.7503  4.83951
 6.57963  11.9922  3.68478
 4.21256   7.6781  2.35919

1 Like

I am completely astonished with this result! But I really did not understand how this was achieved :blush: can you explain to me how this generated function works?

Worth pointing out I started Julia with -O3. The default -O2 doesn’t always generate as many SIMD instructions with StaticArrays on Julia 0.6, but I think it does now on 0.7.

@generated function compose_rotation2(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::Vararg{<:SMatrix{3,3},K}) where K
    quote
        $(Expr(:meta, :inline)) #make it inline
        result = D2*D1

        Base.Cartesian.@nexprs $K i -> ( result = Ds[i] * result )

        result
    end
end

Base.Cartesian documentation:
https://docs.julialang.org/en/stable/devdocs/cartesian/

You asked about unrolling in your opening post and no one answered that, so I figured I’d try it.
@nexprs generates n expressions, which we can use to make an unrolled loop:
For example:

julia> @macroexpand Base.Cartesian.@nexprs 7 i -> (result = Ds[i] * result)
quote 
    result = Ds[1] * result
    result = Ds[2] * result
    result = Ds[3] * result
    result = Ds[4] * result
    result = Ds[5] * result
    result = Ds[6] * result
    result = Ds[7] * result
end

However, it needs a number of expressions (7), not a symbol or expression (K, where K=7).
So…

  1. Get the number of expressions we want to generate with the parametric Varargs.
  2. Wrap the macro in a quote, and interpolate the number into the macro.
  3. Stick that quote in a generated function, so the quote gets evaluated and compiled into a function.

The $(Expr(:meta, :inline)) adds the inline compiler hint.

I’m am surprised by that performance difference though. Often simple for loops are faster than unrolling, because LLVM can unroll them itself and add SIMD instructions. Here though, it must be doing something remarkably clever with the unrolled expression…

julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  13.156 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 8.64171  15.7503  4.83951
 6.57963  11.9922  3.68478
 4.21256   7.6781  2.35919

julia> @btime D1 * D2
  13.155 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 1.0258    0.453464  1.11297 
 0.728052  0.52345   0.79719 
 0.607728  0.391508  0.569485

Somehow, it figured out how to do 8 multiplications in the time it takes to do 1?

The tests were done with (and -O3):

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD FX(tm)-6300 Six-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Piledriver)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, bdver1)

Also, if they are not constant but interpolated:

julia> D3 = @SMatrix rand(3,3);

julia> D4 = @SMatrix rand(3,3);

julia> @btime compose_rotation2($D3,$D4,$D3,$D4,$D3,$D3,$D4,$D3,$D4)
  65.378 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 1.00311   0.25305    0.326064 
 0.184039  0.0464313  0.0598128
 0.847279  0.213699   0.27551 

This is still an improvement over the roughly 100 ns expectation of 13 ns/multiplication * 8 multiplications, or the for loop version.

3 Likes

This is amazing! Thank you very much, I have learned a lot of things today!

I don’t think this is real. On my machine, I get

julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  2.462 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 32.8462  36.2317  25.8697
 19.2184  21.1994  15.1361
 48.8581  53.896   38.4747

That is just too fast. I think this has to do with D1 and D2 being consts. I think the following is a better benchmark:

julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2) setup = begin
           D1 = rand(SMatrix{3, 3})
           D2 = rand(SMatrix{3, 3})
       end
  49.253 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
  5.50704  12.6404   5.48234
 11.153    25.5996  11.103  
  9.06733  20.8124   9.02665

So still good, but not magic.

1 Like

I often use const instead of interpolating for examples and never encountered this before.
Running a couple tests, I’m pretty sure the problem is combining const with inlining. If you comment out the inline statement, running the three versions on Julia 0.7 (started with -O2) on the same computer:

julia> @btime compose_rotation3(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  74.363 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Float64,2,9}:
 27.1386  46.826   41.1644
 44.2324  76.3201  67.0925
 49.9711  86.222   75.7972

julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  13.414 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Float64,2,9}:
 27.1386  46.826   41.1644
 44.2324  76.3201  67.0925
 49.9711  86.222   75.7972

julia> @btime compose_rotation(D1,D2,D1,D2,D1,D1,D2,D1,D2)
  132.141 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Float64,2,9}:
 27.1386  46.826   41.1644
 44.2324  76.3201  67.0925
 49.9711  86.222   75.7972

I think when we inline the function being called with constants into the benchmark, suddenly those constant arguments become compile time knowns, and we basically end up with a version of this:

julia> const c = 3
3

julia> const d = 4
4

julia> f() = c * d
f (generic function with 1 method)

julia> @code_llvm f()

; Function f
; Location: REPL[21]:1
define i64 @julia_f_35391() {
top:
  ret i64 12
}

Still, cool that this can happen with the unrolled version of the expression. And that version still seems notably faster than the for loop variant.