StaticArrays:construct a SMatrix from three vectors

question

#1

I’m using SMatrix from StaticArrays.jl for calculation of coordinate system transformation. In the hotspot, I’m running following (simplified) code:

using StaticArrays, BenchmarkTools
a =  rand(3)
b =  rand(3)
c =  rand(3)
d =  rand(3)
m = hcat((a-d),(b-d),(c-d))
sm = SMatrix{3,3}(hcat((a-d),(b-d),(c-d))...)
m==sm

f1(a,b,c,d) = SMatrix{3,3}(hcat((a-d),(b-d),(c-d))...)
f2(a,b,c,d) = SMatrix{3,3}((a-d)...,(b-d)...,(c-d)...)

@inline function smatrix33(a,b,c)
    @inbounds A = SMatrix{3,3}(a[1], a[2], a[3], b[1], b[2], b[3], c[1], c[2], c[3])
    A
end
f3(a,b,c,d) = smatrix33((a-d),(b-d),(c-d))

The result medians are for f1, f2 and f3 1.2 us, 780 ns and 174 ns.

Is there a better way to fill an SMatrix by column / row vectors, or is it necessary to write a helper function to get the full speed?


#2
julia> A = @SMatrix rand(3, 4)
3×4 StaticArrays.SArray{Tuple{3,4},Float64,2,12}:
 0.644328  0.225848  0.36689   0.243095
 0.590968  0.462177  0.206045  0.404283
 0.973567  0.342697  0.65298   0.901137

julia> f4(A) = A[:, SVector(1, 2, 3)] .- A[:, 4]

julia> @btime f4($A)
  10.637 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 0.401233   -0.0172464   0.123795
 0.186685    0.057894   -0.198239
 0.0724302  -0.55844    -0.248157

#3

I know the arithmetic operations between static arrays are amazingly fast. My concern is, how to effectively construct such a SMatrix out of plain vectors (arrays).

Actually if a,b,c,d are SVector already, then just using hcat seems the fastest approach. Got 100 ns for

a =  @SVector rand(3)
b =  @SVector rand(3)
c =  @SVector rand(3)
d =  @SVector rand(3)
f5(a,b,c,d) = hcat((a-d),(b-d),(c-d))

correctly returning a 3x3 SArray.

I didn’t know there is an optimized method for hcatting SVectors but apparently there is. I really do appreciate Matt Bauman’s work on this package.

But with plain arrays, splatting is too slow and the explicit approach of smatrix33 is the winner so far.


#4
a = rand(3)
b = rand(3)
c = rand(3)
d = rand(3)

function f6(a, b, c, d)
    as = SVector{3}(a)
    bs = SVector{3}(b)
    cs = SVector{3}(c)
    ds = SVector{3}(d)
    return hcat(as-ds, bs-ds, cs-ds)
end

julia> @btime f6($a, $b, $c, $d)
  17.349 ns (0 allocations: 0 bytes)
3×3 StaticArrays.SArray{Tuple{3,3},Float64,2,9}:
 -0.543394   -0.545152  -0.275096 
  0.0925233   0.778087  -0.0803328
 -0.065766    0.832741   0.527109 

#5

Yes, this is the fastest and easiest way to accomplish this.

Thank you, @DNF, I’ve marked the correct answer. Its 30ns in my case, but that’s on a laptop…


#6

My result is on a laptop too, four years old. Perhaps I’m getting SIMD vectorization, or something. Have you tried starting julia with the O3 option like this?

$ julia -03


#7

Oh, and did you time it exactly like this:

julia> using BenchmarkTools
julia> @btime f6($a, $b, $c, $d)

?
If I remove the $ signs I get ~40ns.


#8

Thank you for pointing me to the $ use - I usually don’t forget it to use this for more expensive arguments, but here I forgot. Now with -O3 and $ I’m at 13 ns… on an i5-6300U 2.4GHz.

Still I doubt if SIMD was used. Do you know what to look for in the @code_llvm or @code_native output? Even for such an easy function as f6, I get lost in the output… Is there a specific instruction indicating a SIMD parallel subtract or parallel move?

Just by chance, do you know how to pass the -O3 option to Julia running as Jupyter backend?


#9

You can look for e.g. fadd <4 x double> %6, %10, basically anything that involves a <q x type> where q is usually 2, or 4, and type is the LLVM type.


#10

Thank you, big help.
Petr