Worth pointing out I started Julia with -O3
. The default -O2
doesn’t always generate as many SIMD instructions with StaticArrays on Julia 0.6, but I think it does now on 0.7.
@generated function compose_rotation2(D1::SMatrix{3,3}, D2::SMatrix{3,3}, Ds::Vararg{<:SMatrix{3,3},K}) where K
quote
$(Expr(:meta, :inline)) #make it inline
result = D2*D1
Base.Cartesian.@nexprs $K i -> ( result = Ds[i] * result )
result
end
end
Base.Cartesian documentation:
https://docs.julialang.org/en/stable/devdocs/cartesian/
You asked about unrolling in your opening post and no one answered that, so I figured I’d try it.
@nexprs
generates n
expressions, which we can use to make an unrolled loop:
For example:
julia> @macroexpand Base.Cartesian.@nexprs 7 i -> (result = Ds[i] * result)
quote
result = Ds[1] * result
result = Ds[2] * result
result = Ds[3] * result
result = Ds[4] * result
result = Ds[5] * result
result = Ds[6] * result
result = Ds[7] * result
end
However, it needs a number of expressions (7
), not a symbol or expression (K
, where K=7).
So…
- Get the number of expressions we want to generate with the parametric Varargs.
- Wrap the macro in a quote, and interpolate the number into the macro.
- Stick that quote in a generated function, so the quote gets evaluated and compiled into a function.
The $(Expr(:meta, :inline))
adds the inline compiler hint.
I’m am surprised by that performance difference though. Often simple for loops are faster than unrolling, because LLVM can unroll them itself and add SIMD instructions. Here though, it must be doing something remarkably clever with the unrolled expression…
julia> @btime compose_rotation2(D1,D2,D1,D2,D1,D1,D2,D1,D2)
13.156 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
8.64171 15.7503 4.83951
6.57963 11.9922 3.68478
4.21256 7.6781 2.35919
julia> @btime D1 * D2
13.155 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
1.0258 0.453464 1.11297
0.728052 0.52345 0.79719
0.607728 0.391508 0.569485
Somehow, it figured out how to do 8 multiplications in the time it takes to do 1?
The tests were done with (and -O3):
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17* (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD FX(tm)-6300 Six-Core Processor
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Piledriver)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, bdver1)
Also, if they are not constant but interpolated:
julia> D3 = @SMatrix rand(3,3);
julia> D4 = @SMatrix rand(3,3);
julia> @btime compose_rotation2($D3,$D4,$D3,$D4,$D3,$D3,$D4,$D3,$D4)
65.378 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
1.00311 0.25305 0.326064
0.184039 0.0464313 0.0598128
0.847279 0.213699 0.27551
This is still an improvement over the roughly 100 ns expectation of 13 ns/multiplication * 8 multiplications, or the for loop version.