# Solve linear system repeatedly without allocation

I have a `B \ A` calculation I need to solve repeatedly (10^4β10^5 times) with

1. the same `B`,
2. ideally without allocation, that includes the result
3. `A` is a mutable buffer I want to write to
4. the elements of `A` may be `ForwardDiff.Dual`

I narrowed down a solution that I like (factorize, go static), but I would appreciate suggestions, so I am posting self-contained benchmarks. Julia 1.8, latest package versions.

Some very simple benchmarks that reflect the dimensions of the problem.

## Setup

``````using StaticArrays, BenchmarkTools, LinearAlgebra

n = 49
m = 6
T = Float64
B = rand(T, n, n);
A = rand(T, n, m);
B_lu = lu(B);
A_s = MMatrix{n,m}(A); # recall, I will construct A again and again
B_s = SMatrix{n,n}(B);
B_s_lu = lu(B_s);
# lu with overwritten A which ends up static
function f(::Val{M}, ::Val{N}, A, B) where {M,N}
ldiv!(B, A)
SMatrix{M,N}(A)
end
``````

## The benchmarks with `Float64`

``````julia> @benchmark \$B \ \$A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  22.032 ΞΌs β¦  3.351 ms  β GC (min β¦ max): 0.00% β¦ 90.74%
Time  (median):     24.108 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   26.990 ΞΌs Β± 47.428 ΞΌs  β GC (mean Β± Ο):  2.24% Β±  1.28%

ββββββββββββββββββββββ βββββββββ ββ ββββββ                  β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
22 ΞΌs        Histogram: log(frequency) by time      46.2 ΞΌs <

Memory estimate: 21.73 KiB, allocs estimate: 4.

julia> @benchmark \$B_lu \ \$A # factorize
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min β¦ max):  4.294 ΞΌs β¦ 796.204 ΞΌs  β GC (min β¦ max): 0.00% β¦ 92.55%
Time  (median):     5.023 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   5.380 ΞΌs Β±   7.989 ΞΌs  β GC (mean Β± Ο):  1.37% Β±  0.93%

ββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
4.29 ΞΌs         Histogram: frequency by time        8.34 ΞΌs <

Memory estimate: 2.44 KiB, allocs estimate: 1.

julia> @benchmark \$B_s_lu \ \$A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min β¦ max):  8.937 ΞΌs β¦ 90.601 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     9.157 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   9.725 ΞΌs Β±  1.634 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββββββββββββββββββββ βββ βββ   β        ββββ              β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
8.94 ΞΌs      Histogram: log(frequency) by time     14.5 ΞΌs <

Memory estimate: 2.38 KiB, allocs estimate: 1.

julia> @benchmark f(\$Val(m), \$Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min β¦ max):  5.372 ΞΌs β¦ 877.653 ΞΌs  β GC (min β¦ max): 0.00% β¦ 92.07%
Time  (median):     5.996 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   6.576 ΞΌs Β±  15.013 ΞΌs  β GC (mean Β± Ο):  3.68% Β±  1.60%

βββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
5.37 ΞΌs         Histogram: frequency by time        9.89 ΞΌs <

Memory estimate: 2.38 KiB, allocs estimate: 1.
``````

## The benchmarks with `ForwardDiff.Dual`

A bit of setup, then rerun the same code (incl setup above):

``````import ForwardDiff
_dual(x) = ForwardDiff.Dual(x, Tuple(randn(5)))
A = _dual.(A);
``````
``````julia> @benchmark \$B \ \$A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  115.994 ΞΌs β¦  3.557 ms  β GC (min β¦ max): 0.00% β¦ 84.93%
Time  (median):     128.175 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   136.982 ΞΌs Β± 64.630 ΞΌs  β GC (mean Β± Ο):  1.25% Β±  2.80%

βββββββββββββββββββββββββββββ                                β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
116 ΞΌs        Histogram: log(frequency) by time       247 ΞΌs <

Memory estimate: 145.84 KiB, allocs estimate: 6.

julia> @benchmark \$B_lu \ \$A # factorize
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):   96.735 ΞΌs β¦  7.046 ms  β GC (min β¦ max): 0.00% β¦ 84.50%
Time  (median):     111.906 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   121.081 ΞΌs Β± 90.169 ΞΌs  β GC (mean Β± Ο):  1.71% Β±  2.82%

βββββββββββββββββββββββββ                                   β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
96.7 ΞΌs       Histogram: log(frequency) by time       251 ΞΌs <

Memory estimate: 126.55 KiB, allocs estimate: 3.

julia> @benchmark \$B_s_lu \ \$A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min β¦ max):  8.938 ΞΌs β¦ 183.102 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     9.117 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   9.752 ΞΌs Β±   2.930 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββββββββ βββββββββββ  ββ   β    β          ββ             β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
8.94 ΞΌs      Histogram: log(frequency) by time      14.1 ΞΌs <

Memory estimate: 2.38 KiB, allocs estimate: 1.

julia> @benchmark f(\$Val(m), \$Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  27.163 ΞΌs β¦  2.311 ms  β GC (min β¦ max): 0.00% β¦ 89.31%
Time  (median):     28.529 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   30.160 ΞΌs Β± 23.206 ΞΌs  β GC (mean Β± Ο):  0.68% Β±  0.89%

ββββββββββββββββ ββ  ββ  β        β  βββ                    β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
27.2 ΞΌs      Histogram: log(frequency) by time        46 ΞΌs <

Memory estimate: 13.88 KiB, allocs estimate: 1.
``````
1 Like

Canβt you horizontally stack all `A` and do a single solve?

Unfortunately not, it is an iterative problem (each iteration does this to calculate something, within a minimizer).

Check out FastLapackInterface

Edit: Sorry, I initially missed the part where the elements of `A` may be `ForwardDiff.Dual`. FastLapackInterface isnβt applicable there.

2 Likes

Maybe this helps: Non-allocating matrix inversion - #7 by stevengj

Your 49x49 matrix is probably too big for `StaticArrays` to be effective. In general, `StaticArrays` loses most of its performance benefit (and can start to choke your compiler) as vectors/matrices start to grow above 100ish elements.

I think everything you want is covered by `ldiv!(B_lu,A)`. Itβs allocation-free and works on `ForwardDiff.Dual` matrices without modification. Note that this overwrites `A` with the result. If you want nondestructive solutions, youβll want to copy `A` to a buffer (preallocated like `Abuf = similar(A)`, then `copy!(Abuf,A)` for every call) and then solve in the buffer.

4 Likes

The benchmarks suggest otherwise. That said, I also find this puzzling since the relevant method (`StaticArrays._A_ldiv_B`) does not shortcut for βlargeβ matrices, it is still a generated function which is essentially a tape of operations. Maybe it still fits in the CPU cacheβ¦

I donβt have a solution, but if I understood correctly you could try to use the sweep operator instead of the \ which is available in two packages:

If you donβt need to compute any extra statistics (like the sum of squares). The package from Joshday will likely be faster. I donβt think there is a need to do some allocation.
I didnβt check how different the results for the QC decomposition are from the sweep operator, and I suspect it is mainly related to the characteristic of your data (B).
Hope it helps,

If it is possible to work with `A'` instead of A, then you could just reinterpret the dual numbers as scalars to get an `A` matrix with more columns. Then apply `B_lu` and reinterpret back.

1 Like

Here are my benchmarks, using your definitions from above:

``````A_ss = SMatrix{n,m}(A) # no need to use MMatrix here
Abuf = similar(A);
@btime \(\$B_lu,\$A); # 3.6us, 1 alloc # allocates new result
@btime ldiv!(\$B_lu,\$A); # 2.8us, 0 alloc # overwrites A with result
@btime ldiv!(\$B_lu,copy!(\$Abuf,\$A)); # 3.1us, 0 alloc # result saved to Abuf
@btime \(\$B_s_lu,\$A_s); # 7.0us, 1 alloc # allocates new result
@btime \(\$B_s_lu,\$A_ss); # 7.0us, 0 alloc # "allocates" new result
``````

So for me, `StaticArrays` are 2x slower on this operation. Also, my terminal spins for 5-10s the first time I run a `StaticArrays` version while it compiles the unrolled version. Note that even the allocating `Matrix` version of this solve outperforms the `StaticArrays` versions.

There is no benefit to using a `MArray` over a `SArray` except for the ability to explicitly mutate it. The penalty is that operations on `MArray` are ocassionally slower, since they can be tougher for the compiler to reason about. An `SArray` is immutable and lives on the stack, so rarely results in allocations. `SArray` should be preferred over `MArray` where possible, which is most places.

1 Like

Just to make this thread more searchable for other future users (This question arises from time to time), it is better to add Solve linear system to the title. @Tamas_Papp , Could you do that?

1 Like

I think is a good suggestion, done.

1 Like