Recent developments in StaticArrays.jl land

StaticArrays.jl has received a few cool new features since its 0.12.4 release that are now fast thanks to the memory layout changes in Julia 1.5. These changes are commonly known as non-allocating views but as I show here it’s just the tip of the iceberg :slight_smile:.

Here is the outline to make this post a bit more digestible:

  1. A cool application.
  2. Sized abstract arrays.
  3. Faster matrix multiplication.
  4. Loading time reduction.

These things may not be easy to find so I decided to write about it here.

1. A cool application

Diffusion tensor MRI produces arrays of 3x3 symmetric positive definite (SPD) matrices. Analyzing such data is one possible application of Manifolds.jl . There are multiple operations that can be performed on such data, and one of the is the so-called logarithmic map in the linear-affine metric. It takes two NxN SPD matrices and returns a vector that describes how to get from the first one to the second one. Here is the code that performs this operation:

function log!(M::SymmetricPositiveDefinite{N}, X, p, q) where {N}
    e = eigen(Symmetric(p))
    U = e.vectors
    S = e.values
    Ssqrt = Diagonal(sqrt.(S))
    SsqrtInv = Diagonal(1 ./ sqrt.(S))
    pSqrt = Symmetric(U * Ssqrt * transpose(U))
    pSqrtInv = Symmetric(U * SsqrtInv * transpose(U))
    T = Symmetric(pSqrtInv * q * pSqrtInv)
    e2 = eigen(T)
    Se = Diagonal(log.(max.(e2.values, eps())))
    pUe = pSqrt * e2.vectors
    return mul!(X, pUe, Se * transpose(pUe))
end

Now we may want to compute that for each pair of corresponding matrices from the two arrays we get from DT-MRI. How many allocations would that take? Just one, for the result. Now, how is that possible with so much linear algebra going on? Let’s take a look.

First, we need to somehow tell StaticArrays that we are dealing with static 3x3 matrices, but a lot of them. Traditionally one would make, for example, a 128x128 array of 3x3 SArrays but that’s, from my experience, very inconvenient to work with. Instead, we can use HybridArrays.jl to make a 3x3x128x128 array with the first two axes statically sized and the other two dynamically sized. With that we can easily access the 3x3 parts and have just a normal SMatrix. That’s for reading from it but we also need to write the result somehow. That’s where sized abstract arrays come.

2. Sized abstract arrays

SizedArray was generalized to support adding size information to arbitrary AbstractArrays. For example you can make a sized view. How? Just call view! As long as the size can be figured out from types, the result will be a SubArray wrapped in SizedArray. And that won’t allocate on Julia 1.5+.

julia> using HybridArrays, StaticArrays, BenchmarkTools

julia> x = HybridArray{Tuple{3,3,StaticArrays.Dynamic(),StaticArrays.Dynamic()}}(randn(3,3,128,128));

julia> xv = view(x, :, :, 1, 2)
3×3 SizedArray{Tuple{3,3},Float64,2,2,SubArray{Float64,2,HybridArray{Tuple{3,3,StaticArrays.Dynamic(),StaticArrays.Dynamic()},Float64,4,4,Array{Float64,4}},Tuple{Base.Slice{SOneTo{3}},Base.Slice{SOneTo{3}},Int64,Int64},true}} with indices SOneTo(3)×SOneTo(3):
  0.453334  -0.202884  -0.977482
 -0.311551   0.324371  -0.422867
  1.24581    1.56688    0.0744785

julia> xv[3, 3] = 12.0
12.0

julia> x[3, 3, 1, 2]
12.0

You can make anything sized, I’ve even managed to make a GPU kernel in CUDA.jl broadcast with SizedArray views of parts of CuArrays.

3. Faster matrix multiplication

Sized views are, though, not all that was required to make that log! method non-allocating. The other part is multiplication with Symmetric matrices that allocates on StaticArrays 0.12.4 (it falls back to a generic LinearAlgebra implementation). StaticArrays master has that improved and now you can freely wrap your arrays in Symmetric, Hermitian, Adjoint, triangular matrix wrappers etc. StaticArrays 0.12.4 has special implementations only for Adjoint, Transpose and triangular matrices.

Thanks to @Elrod multiplication of statically sized matrices now also uses muladd that makes it even faster.

4. Loading time reduction

The excellent work on reducing invalidations significantly impacted StaticArrays but now it turns out that loading time is dominated by insertion of new methods to existing functions, a different problem. For example on my computer slightly restricting signature of a 5-argument mul! method decreased loading time by 0.35s. These restricted cases never even worked on StaticArray and no one filed an issue about it.

Depending on how much of this structured matrix multiplication is removed, I get loading times on a recent Julia nightly between 0.4s and 0.8s. However, in contrast with other improvements mentioned here which are already merged to master, there is no consensus yet what to do about it.

As always, there are also other improvements to existing methods and new methods being implemented. The next major release of StaticArrays.jl will be quite cool :slightly_smiling_face: .

52 Likes

Thank you @mateuszbaran for the nice updates!

I wonder if you could share some comments about the following behavior. Could it also be improved in future releases?

julia> A = @SMatrix rand(3,3)
3×3 SArray{Tuple{3,3},Float64,2,9} with indices SOneTo(3)×SOneTo(3):
 0.797002  0.00322277  0.67059
 0.430339  0.115212    0.525332
 0.674385  0.126444    0.417627

julia> A[1:2,1:2]
2×2 Array{Float64,2}:
 0.797002  0.00322277
 0.430339  0.115212

It seems that slicing static arrays breaks the static property of the array.

The non-allocating views in Julia v1.5 really can make a huge difference. :slight_smile:

Btw, you can also view flat arrays as arrays of static array using ArraysOfArrays.jl:

julia> using ArraysOfArrays, StaticArrays

julia> A = nestedview(rand(2, 100), SVector{2})

julia> A[42]
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
 0.5341239725077629
 0.24394153580924982

This is now allocation free, no data is copied (except to the stack).

4 Likes

Thanks!

There was some discussion about improving constant propagation of non-static indices: https://github.com/JuliaArrays/StaticArrays.jl/pull/703 . The brief summary (as I understand it) is that as you start thinking about more and more complex examples of such indexing it is clear that it has to break at some point and it’s not clear that pushing that point a bit further is a good idea. For now you can just use SOneTo: A[SOneTo(2), SOneTo(2)].

1 Like

I see that there is also StaticArrays.SUnitRange construct:

julia> A = @SMatrix rand(4, 5);

julia> v = StaticArrays.SUnitRange(2,3);

julia> A[v, v]
2×2 SArray{Tuple{2,2},Float64,2,4} with indices SOneTo(2)×SOneTo(2):
 0.802085  0.307546
 0.917425  0.419405

and it even works for ordinary arrays:

julia> B = rand(4, 5);

julia> B[v, v]
2×2 SArray{Tuple{2,2},Float64,2,4} with indices SOneTo(2)×SOneTo(2):
 0.205907  0.981022
 0.863682  0.717917

Could there be a ‘static index macro’ that re-wrote:

@SIndex B[2:3, 1:4]

into

julia> B[SUnitRange(2,3), SUnitRange(1,4)]

?

@mateuszbaran my use case is actually this one:

julia> struct Foo{N} end

julia> function Foo{N}() where N
         A = @SMatrix rand(10,10)
         return A[1:N,1:N]
       end

julia> Foo{2}()
2×2 Array{Float64,2}:
 0.849248  0.495188
 0.997704  0.58433

Do I need to play with Val{N} somewhere?

julia> function Foo{N}() where N
                A = @SMatrix rand(10,10)
                return A[SOneTo(N),SOneTo(N)]
              end

julia> Foo{3}()
3×3 SArray{Tuple{3,3},Float64,2,9} with indices SOneTo(3)×SOneTo(3):
 0.594725  0.849064  0.525443
 0.6584    0.313973  0.723844
 0.25742   0.711669  0.695802

Or, depending on what you actually need:

Foo{N}() where {N} = @SMatrix rand(N, N)

would be much faster, but maybe this misses the point?

3 Likes

Thanks @DNF, we will go with SOneTo as you suggested :ok_hand:

We have some intermediate computations between the rand and the slice, that is why we can’t simply construct in one go with @SMatrix rand(N,N).

Yes, this definitely looks possible. The only problem here is that putting such macro in StaticArrays would be quite a lot of work related to handling potential edge cases.

1 Like

No new release with these features yet, but I wanted to add that there’s now a statically sized Cholesky factorization for up to 24x24 matrices – previously, all we had was 3x3 and smaller.

16 Likes

This is awesome work, kudos to you and everyone who has contributed. :+1:

Similar to @oschulz pointing out ArraysOfArrays, perhaps it’s worth pointing out that Julia 1.6 will include a new reinterpret(reshape, T, A) that slightly modifies the API and also fixes most of the performance problems we’ve had with reinterpreted arrays. Here’s a quick demo:

julia> using StaticArrays

julia> A = rand(9, 128, 128);

julia> AA = reinterpret(reshape, SMatrix{3,3,Float64,9}, A);

julia> AA[2, 3]
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
 0.0871237  0.107331   0.778863
 0.537128   0.502304   0.150022
 0.103887   0.0479331  0.198894

julia> A[:, 2, 3]
9-element Vector{Float64}:
 0.08712374721082883
 0.5371283245280383
 0.10388662539559235
 0.1073311564764261
 0.502303884665636
 0.04793312653099191
 0.7788630258282125
 0.1500218148871566
 0.1988941798718733

julia> AA[2, 3] = [1 2 3; 4 5 6; 7 8 9]
3×3 Matrix{Int64}:
 1  2  3
 4  5  6
 7  8  9

julia> AA[2, 3]
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
 1.0  2.0  3.0
 4.0  5.0  6.0
 7.0  8.0  9.0

julia> A[:, 2, 3]
9-element Vector{Float64}:
 1.0
 4.0
 7.0
 2.0
 5.0
 8.0
 3.0
 6.0
 9.0

Worth noting that you only have one dimension (the first) that gets “consumed” by the reinterpret(reshape, ...) (and that only when sizeof(T) is an integer multiple of sizeof(eltype(A))).

8 Likes

Oh, this is very nice, thanks for the heads-up, @tim.holy.

To solve the issue, the array of indices must itself be a StaticArray.
In fact slicing a standard Array with StaticArrays returns a StaticArrays slice.