Package to avoid allocations in some functions when using StaticArrays.jl

Ronis_BR · May 22, 2024, 12:48am

Hi!

As I mentioned before, I am trying to build a full satellite attitude control subsystem (ACS) using Julia, which maybe one day will hopefully fly in a CubeSat. One very important requirement is to avoid runtime allocations.

The entire system is built using StaticArrays from StaticArrays.jl and I manage to create a mockup of an entire ACS without allocations except for one single case: the Kalman filters.

We need to use at least two Kalman filters in our project: one for the attitude and one for the orbit. Those Kalman filters need to invert small matrices (6 x 6 and 3 x 3). The function inv works fine and provide 0 allocation when using SMatrix. However, for embedded systems that must work continuously, it is much safe to use the pseudo-inverse pinv because the numerical errors can lead to singular matrices (we are using Float32), especially when the KF has converged and the covariance is small.

The problem is that StaticArrays.jl just converts the StaticMatrix to Matrix and call the Base function svd, which allocates, to compute the singular value decomposition required to obtain the pseudo-inverse. I opened a ticket here:

github.com/JuliaArrays/StaticArrays.jl

svd allocates even with small matrices

opened 10:21PM - 16 May 24 UTC

ronisbr

Hi! When computing the SVD of a static matrix, we see some allocations: ``…`julia julia> using StaticArrays, BenchmarkTools julia> S = @SMatrix rand(3, 3); julia> @btime StaticArrays.svd($S) 1.333 μs (8 allocations: 2.48 KiB) ``` which is probably caused because the algorithm just converts the input to `Matrix` and call the `svd` in `Base`. The problem here is that `svd` is used to compute `pinv` that is somewhat very used when inverting matrices for embedded Kalman filters. I am pursing to embed a full satellite attitude control algorithm written entirely in Julia. However, I must avoid all allocations. In this specific case, I have not yet found a workaround to avoid allocations. Can anyone help me?

To solve this problem, my first attempt was to rewrite BLAS’s svd in Julia, which turns out to be an Herculean task. The second attempt was to use a naïve algorithm, which is slow but given the lack of allocations it should be faster than the current approach. No dice, the code I wrote was slower than the current version for any matrix bigger than 2 x 2.

Finally, I had an insight! I saw sometime ago that Julia’s compiler got smarter enough to avoid allocations inside a function if you use MMatrixs that do not exit the scope and then convert the result back to SMatrix. Hence, I thought: “why don’t we convert the static matrix to the mutable version, call the BLAS function using ccall, and then convert the result back to SMatrix?” If everything is type stable, it should avoid all allocations!

I mentioned this in that thread, but the developers think this approach should not be used inside StaticArrays.jl, at least for now. Hence, I create a new package called StaticArraysBlasInterface.jl, which is not registered yet and can be seen here:

Basically, it overloads functions in StaticArrays.jl with some supported types that leads to a allocation-free computation.

Currently, I only overloaded the svd function, which also provided an allocation-free pinv:

julia> using LinearAlgebra, StaticArrays, BenchmarkTools

julia> A = @SMatrix randn(6, 6);

julia> @btime pinv($A)
  3.542 μs (8 allocations: 5.30 KiB)
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
 -34.7034   -14.9662   -58.5796    57.7484    68.2193   120.044
 -21.7112    -9.89541  -36.0827    36.773     42.9534    76.1427
  -8.17912   -3.30241  -14.1337    13.6379    16.8462    28.4846
  87.5626    38.9288   144.489   -147.519   -170.758   -302.945
 -52.6933   -23.7337   -87.4295    89.1886   103.221    182.963
  17.2491     7.57408   28.6682   -28.5039   -33.6592   -59.0681

julia> using StaticArraysBlasInterface

julia> @btime pinv($A)
  2.893 μs (0 allocations: 0 bytes)
6×6 SMatrix{6, 6, Float64, 36} with indices SOneTo(6)×SOneTo(6):
 -34.7034   -14.9662   -58.5796    57.7484    68.2193   120.044
 -21.7112    -9.89541  -36.0827    36.773     42.9534    76.1427
  -8.17912   -3.30241  -14.1337    13.6379    16.8462    28.4846
  87.5626    38.9288   144.489   -147.519   -170.758   -302.945
 -52.6933   -23.7337   -87.4295    89.1886   103.221    182.963
  17.2491     7.57408   28.6682   -28.5039   -33.6592   -59.0681

There is some benchmarks here:

github.com

ronisbr/StaticArraysBlasInterfaces.jl/blob/main/benchmarks.md

# Benchmarks of StaticArraysBlasInterface.jl

## Singular Value Decomposition (Thin) (`Float64`)

| **Matrix Dimension** | **Time Before [ms]** | **Allocations Before [kB]** | **Time After [ms]** | **Allocations After [kB]** | **Gain [%]** |
|---------------------:|---------------------:|----------------------------:|--------------------:|---------------------------:|-------------:|
| 2 × 2                | 0.919057             | 1.76562                     | 0.439394            | 0.0                        | 52.2         |
| 3 × 3                | 1.4625               | 2.48438                     | 0.983333            | 0.0                        | 32.8         |
| 4 × 4                | 2.11567              | 3.48438                     | 1.5708              | 0.0                        | 25.8         |
| 5 × 5                | 3.02075              | 4.29688                     | 2.47222             | 0.0                        | 18.2         |
| 6 × 6                | 4.33929              | 5.29688                     | 3.71875             | 0.0                        | 14.3         |
| 7 × 7                | 4.82143              | 6.20312                     | 4.131               | 0.0                        | 14.3         |
| 8 × 8                | 6.45                 | 7.25                        | 5.90967             | 0.0                        | 8.4          |
| 9 × 9                | 8.18033              | 8.59375                     | 7.5                 | 0.0                        | 8.3          |
| 10 × 10              | 9.042                | 9.79688                     | 8.04167             | 0.0                        | 11.1         |

## Singular Value Decomposition (Thin) (`Float32`)

| **Matrix Dimension** | **Time Before [ms]** | **Allocations Before [kB]** | **Time After [ms]** | **Allocations After [kB]** | **Gain [%]** |
|---------------------:|---------------------:|----------------------------:|--------------------:|---------------------------:|-------------:|

This file has been truncated. show original

The results show a significant gain for small matrices (my use case) and an allocation-free algorithm that fits my requirements for the embedded software development. Notice that even if this algorithm was 2x slower, I would still need it because “no allocations” is a very hard requirement

If possible, I would like some feedback if this is a good approach, if it should be in a separate package or committed to StatciArrays.jl, and if there are more functions you want to see here.

Thanks!

mikmoore · May 22, 2024, 2:45pm

It’s not necessarily exactly piracy (or maybe it is, I haven’t looked closely), but your package changes the way that LinearAlgebra.svd and StaticArrays.SMatrix interact whenever it’s loaded. I don’t love that (or, if it’s piracy, I hate it), since it can affect other aprts of a user’s calculations too. It seems that this really would do better as a direct PR to StaticArrays or you should at least consider leaving your package unregistered. But I acknowledge that PRs tend to be slow.

Technical sidebar:

If numerical accuracy is any concern, you should strongly consider a “square root form” Kalman filter. This filter performs all of the Kalman filter equations on a factorized version of the covariance (often its Cholesky factorization). Numerically, the root form gives your roughly double the number of bits of precision because in P = L L^\top, the condition number of P is the square of that of L. In other words, a root form KF in Float32 has similar accuracy to a standard form KF in Float64. The implementation is more technical but ultimately similar in computational cost.

I haven’t looked into it deeply, but this paper looks to have a nice presentation of an implementation. Although the fact that you complain of singularities (rather than infinities) in your calculations makes me think you might be using an information form KF (rather than the more traditional moment form). It’s also possible to form the sqrt KF equations in information form.

You’d probably still have to massage the factor matrix (or the back-solves with it), to mitigate potential singularities (like what the pinv was accomplishing before, but probably slightly more challenging now). But the extra precision offered by the root form should make numerical issues much less common to begin with (prevention > cure).

Ronis_BR · May 22, 2024, 4:36pm

Hi @mikmoore !

It is somewhat type-piracy since I am overloading the behavior of the function StaticArrays._svd when the input is a StaticMatrix. However, I consider to be a “light” type-piracy because, at the end, there is no effect on user calculations. Notice that I am using exactly the same BLAS function with the same input parameters (except from the work array size) as the current implementation uses. The difference is that instead converting to a Matrix, I am using MMatrix to call the BLAS function.

Hence, I think the correct way would be to register this package so that the users can test it. If nothing breaks, I can consider submitting it to StaticArrays.jl if devs want.

Regarding the QR Kalman Filter, we tested it. However, the requirement to compute QR decompositions twice per step, leads to a substantially slower algorithm. This kind of application would only be useful if you have a very low number of bits (like Float16).

In current applications, you can scale the variable to avoid singularities. In fact, I haven’t found a single case where the KFs I implemented in the embedded software returned a singular matrix. However, since this is a critical system, I need to use pinv just in case.

mikmoore · May 22, 2024, 5:46pm

To me, it sounds like you should just make the PR rather than register the package. It doesn’t sound like there’s a downside to your approach (and, if there is one, it can be avoided by only using MArray below a certain size). A PR avoids any risk of piracy or strangeness. Even “no-op” overloads can cause real trouble if they accidentally cause any deviation from unmodified results. Don’t forget a hypothetical future where StaticArrays gets its own SVD that conflicts with this and does produce a non-identical result.

You can solicit people to test this as a package even if it isn’t registered. I don’t expect registration to significantly increase the number of people that find or test a small package like this (since it is merely a performance improvement to a specific usage).

A package that is registered and then deprecated a year later will still remain in the registry indefinitely. Yes, packages are abandoned all the time (and at least this isn’t sitting on some valuable naming real estate), but it doesn’t seem like it should be the initial plan.

I’m surprised to hear that computing QR twice is slower than SVD (especially compounding that with StaticArrays vs BLAS at small sizes!). My computer does the SMatrix{6,6} SVD in about 4us (and your package appears to improve this somewhat by avoiding allocations, but not more than 25%) versus QR in 80ns. But it sounds like you’ve considered this so I’ll rest that case.

Since you say this handling is needed rarely and is mostly to avoid crashes, you can also do simpler things like attempt to factor the matrix or solve the system and, if singularity issues occur, only try to resolve those then.

What I’ve done in the past is attempt to run the filter and, if there’s a singularity issue, add a scaled identity matrix of sufficient size to restore conditioning and then re-try. For example, X + tr(X) * S * I will have the same eigenvectors as X and a condition number no worse than 1 + 1/S (because eigmax(X) <= tr(X) for any PSD X), so S=100*eps(eltype(x)) might be a reasonable choice. It’s a larger perturbation than you can get with svd/pinv, but can still be fairly small.

Ronis_BR · May 22, 2024, 5:57pm

The problem is that this approach was suggested to me in that thread in StaticArrays.jl repository. Hence, I think I am left without options here.

The inverse in the Kalman filter happens with a matrix the same size of your measurement vector, whereas the QR decomposition in that filter happens in matrices with the dimension of your measurement vector and state vector. In my case, the state vector has 15 variables whereas the measurement vector has only 3 (or 6) variables. That’s why I obtained that difference when computing the QR decomposition. Notice that even in the QR version, you still need to invert a matrix.

How did you catch the problem? try catch? Determinant? This approach seems interesting if the verification is simple.

Thanks for the help!

mikmoore · May 22, 2024, 6:48pm

I see that it was suggested there, but if this looks like a slam dunk (it does to me) then I’d suggest you just open the PR anyway. It will either be accepted or you’ll get some concrete criticism that you can use to improve it. If you run into unresolvable disagreements, then consider redirecting the code to a package.

How are you performing this inverse in your code? I’m assuming the inverse in question is (nominally) a positive semidefinite matrix. I guess right now you’re using something like A * pinv(B), but presumably you’d ideally do this llke A / B. Notice that this is always done by factorizing B (e.g., is implemented like A / factorize(B) – see the documentation for factorize). There’s no loss (and maybe a tiny gain) to manually specifying what factorization to use. And interacting with the factorization lets you examine the details to see how bad the situation might be.

You could always try/catch a SingularException and make the correction in catch. Alternatively, many factorizations have a check keyword you can use to disable error throwing. You can then check issuccess on the resulting factorization to see whether it’s valid and make the correction if it’s not. Something like

using LinearAlgebra
# also consider whether you might want a pivoted factorization
F = cholesky(Hermitian(B); check=false)
if !issuccess(F) # check for factorization failure
	B1 = B + 100*eps(tr(B)) * I
	# let it throw this time, since the problem must be much deeper if it does
	F = cholesky(Hermitian(B1))
end
X = A / F # A/B if possible or or A/B1 if not

Since you say this virtually never triggers, there is almost no performance cost to this. If the issue is purely numerical rank, you could alternatively just mangle the factorization manually to make it usable, but you must carefully consider how this could change your system.

ufechner7 · May 22, 2024, 6:55pm

Does try/ catch allocate? Then it cannot be used with StaticCompiler.jl.

stevengj · May 22, 2024, 7:16pm

Have you considered a Tikhonov-regularized Kalman filter? There seem to be several published variants of this notion, but my guess (caveat: without reading the papers) is that they are all conceptually centered on replacing a truncated pseudo-inverse A^+ (where you drop singular values smaller than \sigma_0, i.e. “truncate” the SVD) with a Tikhonov regularization (A^T A + \sigma_0^2 I)^{-1} A^T.

Alternatively, you can compute an analogue of your pseudo-inverse with a “truncated QR” (column-pivoted QR where you simply drop columns where the diagonal elements of R become too small). Then you are replacing one SVD with one QR factorization, which should definitely be faster (even if it is theoretically less reliable than SVD).

(Note that in any of these cases, you may not need to compute any matrix inverse explicitly — my recollection is that you just need to multiply it by another matrix in the Kalman algorithm, which can usually be done more quickly than explicitly forming the inverse and then multiplying it.)

Palli · May 22, 2024, 7:34pm

The approach in Zig language is that there is no global malloc, you must provide an allocator, or point to the default suggested one.

Maybe some functions in Julia should allow to plug in an allocator, instead of using similar (which is basically code for “malloc”). Bumper.jl already does it for you, but I only think for allocations within your loop. You want to do similar, and it will free memory for you (so you don’t need to worry about heap allocations, they feel like stack allocated), and if you can’t call an allocating function, then you can copy-paste the svd code (or inline it?) and make the allocator substitute (and free) work.

If you only need svd and can’t do it with Julia, or the approach above, then maybe call other language, C/C++, Rust or Zig that has it? Or even just Python (at least as a test, Python is reference counted, and can allocate and free, but unclear to me if that happens, and Julia’s GC takes over disabling RC/free)?

Ronis_BR · May 22, 2024, 8:23pm

There is other requirement that the embedded software must not use try/catch unfortunately.

Thanks for the suggestion! I did not consider this approach but I will definitely take a look at this implementation.

Thanks for the suggestion! Something much more simple that will make the development much easier would be if Julia has some directive to throw an error if something inside a function allocates. Hence, I can code the embedded software, run in a closed-loop, and check. If we had an error, I will know exactly where the allocation is happening. Eigen has something similar but it catches allocations at compile time.

simsurace · May 22, 2024, 9:15pm

If the PR into StaticArrays is not accepted and it has to live in a separate package, and if you still need/want to register it, I would suggest to change the name of the function such that there is no piracy. There does not appear to be any downside to doing so.

Zentrik · May 22, 2024, 9:59pm

You might want to check out GitHub - JuliaLang/AllocCheck.jl: AllocCheck.

Ronis_BR · May 24, 2024, 1:48am

Unfortunately, it will not work as I expect! If I change the function, pre-existing packages that uses pinv must be adapted to use the new function. In the current state, I just need to load the package to have access to the new version. It is something like what we have when loading AppleAccelerate.jl, you load the package and the supported functions get a performance improvement.

simsurace · May 24, 2024, 1:36pm

I understand the downsides, but if the fix is not upstreamed I find that (opt-in by having to explicitly call a different function) vastly preferable to the piracy situation that would otherwise exist.

Ronis_BR · May 24, 2024, 9:17pm

Maybe I was not clear enough. The reason to create this package is to avoid allocations caused by pinv (and other functions that I will implement shortly), for example, inside the Kalman filer. However, there are other Julia packages that use pinv internally and we will use inside the embedded software (see, for example, functions in the SatelliteToolbox.jl ecosystem). Hence, your approach is unfeasible because I will need to modify many packages to force them to use the new version.

Bruno_Amorim · May 25, 2024, 10:22am

A possible compromise regarding type-piracy is to make it optional: exports a functions with a different name and provide an option to commit type-piracy.

This is what I have done in ThreadedSparseCSR.jl: [ANN] Announcing ThreadedSparseCSR.jl

By default the package exports functions tmul and bmul which provide an aternative to mul and *. If one wishes to do so, the package provides the option to do type-piracy by overwriting mul and * by tmul or bmul

Ronis_BR · May 25, 2024, 2:13pm

This approach seems very good! Thanks for sharing, I will implement it.

ufechner7 · May 25, 2024, 2:40pm

But did you try to get your code merged into StaticArrays.jl ? If yes, where can I find the pull request?

Ronis_BR · May 25, 2024, 2:54pm

In the thread I posted, one of the StaticArrays.jl developers said that an external package was the best thing to do now.

Topic		Replies	Views
Non-allocating matrix inversion Performance matrices	13	3565	April 23, 2022
Linear algebra performance issue General Usage	20	1091	July 10, 2020
Computing Inverse of a stack of matrices General Usage performance , linearalgebra	19	3599	August 29, 2020
Help with allocations (variable length least squares) New to Julia memory-allocation	10	1157	March 25, 2021
Solving linear system without allocating Numerics	14	292	June 29, 2025

Package to avoid allocations in some functions when using StaticArrays.jl

Related topics