 # Recent developments in StaticArrays.jl land

StaticArrays.jl has received a few cool new features since its 0.12.4 release that are now fast thanks to the memory layout changes in Julia 1.5. These changes are commonly known as non-allocating views but as I show here it’s just the tip of the iceberg .

Here is the outline to make this post a bit more digestible:

1. A cool application.
2. Sized abstract arrays.
3. Faster matrix multiplication.

These things may not be easy to find so I decided to write about it here.

## 1. A cool application

Diffusion tensor MRI produces arrays of 3x3 symmetric positive definite (SPD) matrices. Analyzing such data is one possible application of Manifolds.jl . There are multiple operations that can be performed on such data, and one of the is the so-called logarithmic map in the linear-affine metric. It takes two NxN SPD matrices and returns a vector that describes how to get from the first one to the second one. Here is the code that performs this operation:

``````function log!(M::SymmetricPositiveDefinite{N}, X, p, q) where {N}
e = eigen(Symmetric(p))
U = e.vectors
S = e.values
Ssqrt = Diagonal(sqrt.(S))
SsqrtInv = Diagonal(1 ./ sqrt.(S))
pSqrt = Symmetric(U * Ssqrt * transpose(U))
pSqrtInv = Symmetric(U * SsqrtInv * transpose(U))
T = Symmetric(pSqrtInv * q * pSqrtInv)
e2 = eigen(T)
Se = Diagonal(log.(max.(e2.values, eps())))
pUe = pSqrt * e2.vectors
return mul!(X, pUe, Se * transpose(pUe))
end
``````

Now we may want to compute that for each pair of corresponding matrices from the two arrays we get from DT-MRI. How many allocations would that take? Just one, for the result. Now, how is that possible with so much linear algebra going on? Let’s take a look.

First, we need to somehow tell StaticArrays that we are dealing with static 3x3 matrices, but a lot of them. Traditionally one would make, for example, a 128x128 array of 3x3 `SArray`s but that’s, from my experience, very inconvenient to work with. Instead, we can use HybridArrays.jl to make a 3x3x128x128 array with the first two axes statically sized and the other two dynamically sized. With that we can easily access the 3x3 parts and have just a normal `SMatrix`. That’s for reading from it but we also need to write the result somehow. That’s where sized abstract arrays come.

## 2. Sized abstract arrays

`SizedArray` was generalized to support adding size information to arbitrary `AbstractArray`s. For example you can make a sized view. How? Just call `view`! As long as the size can be figured out from types, the result will be a `SubArray` wrapped in `SizedArray`. And that won’t allocate on Julia 1.5+.

``````julia> using HybridArrays, StaticArrays, BenchmarkTools

julia> x = HybridArray{Tuple{3,3,StaticArrays.Dynamic(),StaticArrays.Dynamic()}}(randn(3,3,128,128));

julia> xv = view(x, :, :, 1, 2)
3×3 SizedArray{Tuple{3,3},Float64,2,2,SubArray{Float64,2,HybridArray{Tuple{3,3,StaticArrays.Dynamic(),StaticArrays.Dynamic()},Float64,4,4,Array{Float64,4}},Tuple{Base.Slice{SOneTo{3}},Base.Slice{SOneTo{3}},Int64,Int64},true}} with indices SOneTo(3)×SOneTo(3):
0.453334  -0.202884  -0.977482
-0.311551   0.324371  -0.422867
1.24581    1.56688    0.0744785

julia> xv[3, 3] = 12.0
12.0

julia> x[3, 3, 1, 2]
12.0
``````

You can make anything sized, I’ve even managed to make a GPU kernel in CUDA.jl broadcast with `SizedArray` views of parts of `CuArray`s.

## 3. Faster matrix multiplication

Sized views are, though, not all that was required to make that `log!` method non-allocating. The other part is multiplication with `Symmetric` matrices that allocates on StaticArrays 0.12.4 (it falls back to a generic `LinearAlgebra` implementation). StaticArrays master has that improved and now you can freely wrap your arrays in `Symmetric`, `Hermitian`, `Adjoint`, triangular matrix wrappers etc. StaticArrays 0.12.4 has special implementations only for `Adjoint`, `Transpose` and triangular matrices.

Thanks to @Elrod multiplication of statically sized matrices now also uses `muladd` that makes it even faster.

The excellent work on reducing invalidations significantly impacted StaticArrays but now it turns out that loading time is dominated by insertion of new methods to existing functions, a different problem. For example on my computer slightly restricting signature of a 5-argument mul! method decreased loading time by 0.35s. These restricted cases never even worked on StaticArray and no one filed an issue about it.

Depending on how much of this structured matrix multiplication is removed, I get loading times on a recent Julia nightly between 0.4s and 0.8s. However, in contrast with other improvements mentioned here which are already merged to master, there is no consensus yet what to do about it.

As always, there are also other improvements to existing methods and new methods being implemented. The next major release of StaticArrays.jl will be quite cool .

46 Likes

Thank you @mateuszbaran for the nice updates!

I wonder if you could share some comments about the following behavior. Could it also be improved in future releases?

``````julia> A = @SMatrix rand(3,3)
3×3 SArray{Tuple{3,3},Float64,2,9} with indices SOneTo(3)×SOneTo(3):
0.797002  0.00322277  0.67059
0.430339  0.115212    0.525332
0.674385  0.126444    0.417627

julia> A[1:2,1:2]
2×2 Array{Float64,2}:
0.797002  0.00322277
0.430339  0.115212
``````

It seems that slicing static arrays breaks the static property of the array.

The non-allocating views in Julia v1.5 really can make a huge difference. Btw, you can also view flat arrays as arrays of static array using ArraysOfArrays.jl:

``````julia> using ArraysOfArrays, StaticArrays

julia> A = nestedview(rand(2, 100), SVector{2})

julia> A
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
0.5341239725077629
0.24394153580924982
``````

This is now allocation free, no data is copied (except to the stack).

4 Likes

Thanks!

There was some discussion about improving constant propagation of non-static indices: https://github.com/JuliaArrays/StaticArrays.jl/pull/703 . The brief summary (as I understand it) is that as you start thinking about more and more complex examples of such indexing it is clear that it has to break at some point and it’s not clear that pushing that point a bit further is a good idea. For now you can just use `SOneTo`: `A[SOneTo(2), SOneTo(2)]`.

1 Like

I see that there is also `StaticArrays.SUnitRange` construct:

``````julia> A = @SMatrix rand(4, 5);

julia> v = StaticArrays.SUnitRange(2,3);

julia> A[v, v]
2×2 SArray{Tuple{2,2},Float64,2,4} with indices SOneTo(2)×SOneTo(2):
0.802085  0.307546
0.917425  0.419405
``````

and it even works for ordinary arrays:

``````julia> B = rand(4, 5);

julia> B[v, v]
2×2 SArray{Tuple{2,2},Float64,2,4} with indices SOneTo(2)×SOneTo(2):
0.205907  0.981022
0.863682  0.717917
``````

Could there be a ‘static index macro’ that re-wrote:

``````@SIndex B[2:3, 1:4]
``````

into

``````julia> B[SUnitRange(2,3), SUnitRange(1,4)]
``````

?

@mateuszbaran my use case is actually this one:

``````julia> struct Foo{N} end

julia> function Foo{N}() where N
A = @SMatrix rand(10,10)
return A[1:N,1:N]
end

julia> Foo{2}()
2×2 Array{Float64,2}:
0.849248  0.495188
0.997704  0.58433
``````

Do I need to play with `Val{N}` somewhere?

``````julia> function Foo{N}() where N
A = @SMatrix rand(10,10)
return A[SOneTo(N),SOneTo(N)]
end

julia> Foo{3}()
3×3 SArray{Tuple{3,3},Float64,2,9} with indices SOneTo(3)×SOneTo(3):
0.594725  0.849064  0.525443
0.6584    0.313973  0.723844
0.25742   0.711669  0.695802
``````

Or, depending on what you actually need:

``````Foo{N}() where {N} = @SMatrix rand(N, N)
``````

would be much faster, but maybe this misses the point?

2 Likes

Thanks @DNF, we will go with `SOneTo` as you suggested We have some intermediate computations between the `rand` and the slice, that is why we can’t simply construct in one go with `@SMatrix rand(N,N)`.

Yes, this definitely looks possible. The only problem here is that putting such macro in StaticArrays would be quite a lot of work related to handling potential edge cases.

1 Like

No new release with these features yet, but I wanted to add that there’s now a statically sized Cholesky factorization for up to 24x24 matrices – previously, all we had was 3x3 and smaller.

14 Likes

This is awesome work, kudos to you and everyone who has contributed. Similar to @oschulz pointing out ArraysOfArrays, perhaps it’s worth pointing out that Julia 1.6 will include a new `reinterpret(reshape, T, A)` that slightly modifies the API and also fixes most of the performance problems we’ve had with reinterpreted arrays. Here’s a quick demo:

``````julia> using StaticArrays

julia> A = rand(9, 128, 128);

julia> AA = reinterpret(reshape, SMatrix{3,3,Float64,9}, A);

julia> AA[2, 3]
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
0.0871237  0.107331   0.778863
0.537128   0.502304   0.150022
0.103887   0.0479331  0.198894

julia> A[:, 2, 3]
9-element Vector{Float64}:
0.08712374721082883
0.5371283245280383
0.10388662539559235
0.1073311564764261
0.502303884665636
0.04793312653099191
0.7788630258282125
0.1500218148871566
0.1988941798718733

julia> AA[2, 3] = [1 2 3; 4 5 6; 7 8 9]
3×3 Matrix{Int64}:
1  2  3
4  5  6
7  8  9

julia> AA[2, 3]
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
1.0  2.0  3.0
4.0  5.0  6.0
7.0  8.0  9.0

julia> A[:, 2, 3]
9-element Vector{Float64}:
1.0
4.0
7.0
2.0
5.0
8.0
3.0
6.0
9.0
``````

Worth noting that you only have one dimension (the first) that gets “consumed” by the `reinterpret(reshape, ...)` (and that only when `sizeof(T)` is an integer multiple of `sizeof(eltype(A))`).

7 Likes

Oh, this is very nice, thanks for the heads-up, @tim.holy.