Solve linear system repeatedly without allocation

I have a B \ A calculation I need to solve repeatedly (10^4–10^5 times) with

  1. the same B,
  2. ideally without allocation, that includes the result
  3. A is a mutable buffer I want to write to
  4. the elements of A may be ForwardDiff.Dual

I narrowed down a solution that I like (factorize, go static), but I would appreciate suggestions, so I am posting self-contained benchmarks. Julia 1.8, latest package versions.

Some very simple benchmarks that reflect the dimensions of the problem.

Setup

using StaticArrays, BenchmarkTools, LinearAlgebra

n = 49
m = 6
T = Float64
B = rand(T, n, n);
A = rand(T, n, m);
B_lu = lu(B);
A_s = MMatrix{n,m}(A); # recall, I will construct A again and again
B_s = SMatrix{n,n}(B);
B_s_lu = lu(B_s);
# lu with overwritten A which ends up static
function f(::Val{M}, ::Val{N}, A, B) where {M,N}
    ldiv!(B, A)
    SMatrix{M,N}(A)
end

The benchmarks with Float64

julia> @benchmark $B \ $A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  22.032 ΞΌs …  3.351 ms  β”Š GC (min … max): 0.00% … 90.74%
 Time  (median):     24.108 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   26.990 ΞΌs Β± 47.428 ΞΌs  β”Š GC (mean Β± Οƒ):  2.24% Β±  1.28%

  β–‚β–†β–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–„β–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β– ▁▂▃▃▂▁▁▁▁ ▁▁ ▁▁▁▁▁▁                  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–†β–†β–†β–…β–…β–‡β–†β–…β–…β–…β–†β–‡ β–ˆ
  22 ΞΌs        Histogram: log(frequency) by time      46.2 ΞΌs <

 Memory estimate: 21.73 KiB, allocs estimate: 4.

julia> @benchmark $B_lu \ $A # factorize
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.294 ΞΌs … 796.204 ΞΌs  β”Š GC (min … max): 0.00% … 92.55%
 Time  (median):     5.023 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.380 ΞΌs Β±   7.989 ΞΌs  β”Š GC (mean Β± Οƒ):  1.37% Β±  0.93%

       β–ƒβ–†β–ˆβ–ˆβ–†β–‡β–„β–‚                                                
  β–β–‚β–ƒβ–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  4.29 ΞΌs         Histogram: frequency by time        8.34 ΞΌs <

 Memory estimate: 2.44 KiB, allocs estimate: 1.

julia> @benchmark $B_s_lu \ $A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.937 ΞΌs … 90.601 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     9.157 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.725 ΞΌs Β±  1.634 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‚β–ˆβ–ˆβ–…β–ƒβ–‚β–β–‚β–‚β–β–β–ƒβ–‚β–β–ƒβ–‚β–β–β–‚β–β– ▁▂▁ ▁▁▁   ▁        ▁▂▂▁              β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–…β–†β–‡β–ˆβ–ˆβ–‡β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–β–…β–†β–†β–„β–„β–ƒβ–…β–„β–„ β–ˆ
  8.94 ΞΌs      Histogram: log(frequency) by time     14.5 ΞΌs <

 Memory estimate: 2.38 KiB, allocs estimate: 1.

julia> @benchmark f($Val(m), $Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.372 ΞΌs … 877.653 ΞΌs  β”Š GC (min … max): 0.00% … 92.07%
 Time  (median):     5.996 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.576 ΞΌs Β±  15.013 ΞΌs  β”Š GC (mean Β± Οƒ):  3.68% Β±  1.60%

    β–‚β–‚β–ˆβ–ˆβ–ˆβ–†β–…β–ƒβ–                                                  
  β–β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  5.37 ΞΌs         Histogram: frequency by time        9.89 ΞΌs <

 Memory estimate: 2.38 KiB, allocs estimate: 1.

The benchmarks with ForwardDiff.Dual

A bit of setup, then rerun the same code (incl setup above):

import ForwardDiff
_dual(x) = ForwardDiff.Dual(x, Tuple(randn(5)))
A = _dual.(A);
julia> @benchmark $B \ $A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  115.994 ΞΌs …  3.557 ms  β”Š GC (min … max): 0.00% … 84.93%
 Time  (median):     128.175 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   136.982 ΞΌs Β± 64.630 ΞΌs  β”Š GC (mean Β± Οƒ):  1.25% Β±  2.80%

  β–ƒβ–ˆβ–†β–„β–…β–†β–†β–„β–ƒβ–‚β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–β–β–‚β–β–β–β–β–β–β–β–                                β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–‡β–†β–‡β–†β–†β–…β–…β–…β–†β–…β–†β–†β–…β–…β–…β–…β–†β–„β–…β–„β–…β–„β–…β–ƒβ–„ β–ˆ
  116 ΞΌs        Histogram: log(frequency) by time       247 ΞΌs <

 Memory estimate: 145.84 KiB, allocs estimate: 6.

julia> @benchmark $B_lu \ $A # factorize
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   96.735 ΞΌs …  7.046 ms  β”Š GC (min … max): 0.00% … 84.50%
 Time  (median):     111.906 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   121.081 ΞΌs Β± 90.169 ΞΌs  β”Š GC (mean Β± Οƒ):  1.71% Β±  2.82%

   β–…β–†β–„β–ƒβ–ˆβ–‡β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–β–‚β–‚β–β–β–β–β–β–β–ƒβ–ƒ                                   β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–‡β–†β–†β–…β–…β–…β–†β–„β–…β–…β–†β–†β–„β–…β–†β–„β–…β–ƒβ–…β–„β–β–„β–„β–…β–…β–„β–…β–„β–† β–ˆ
  96.7 ΞΌs       Histogram: log(frequency) by time       251 ΞΌs <

 Memory estimate: 126.55 KiB, allocs estimate: 3.

julia> @benchmark $B_s_lu \ $A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.938 ΞΌs … 183.102 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     9.117 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.752 ΞΌs Β±   2.930 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–ˆβ–‡β–„β–‚β–‚β–β–β–β– ▂▃▂▁▂▂▁▁▂▂▁  ▁▁   ▁    ▁          ▂▁             β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–ˆβ–ˆβ–‡β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–…β–†β–…β–…β–β–… β–ˆ
  8.94 ΞΌs      Histogram: log(frequency) by time      14.1 ΞΌs <

 Memory estimate: 2.38 KiB, allocs estimate: 1.

julia> @benchmark f($Val(m), $Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  27.163 ΞΌs …  2.311 ms  β”Š GC (min … max): 0.00% … 89.31%
 Time  (median):     28.529 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   30.160 ΞΌs Β± 23.206 ΞΌs  β”Š GC (mean Β± Οƒ):  0.68% Β±  0.89%

  β–†β–ˆβ–†β–‡β–…β–…β–‡β–„β–ƒβ–„β–ƒβ–ƒβ–„β–‚β–‚β–‚ ▁▁  ▂▁  ▁        ▁  ▁▁▁                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–†β–…β–†β–†β–†β–†β–‡β–†β–…β–„β–…β–„β–…β–„β–… β–ˆ
  27.2 ΞΌs      Histogram: log(frequency) by time        46 ΞΌs <

 Memory estimate: 13.88 KiB, allocs estimate: 1.
1 Like

Can’t you horizontally stack all A and do a single solve?

Unfortunately not, it is an iterative problem (each iteration does this to calculate something, within a minimizer).

Check out FastLapackInterface

Edit: Sorry, I initially missed the part where the elements of A may be ForwardDiff.Dual. FastLapackInterface isn’t applicable there.

2 Likes

Maybe this helps: Non-allocating matrix inversion - #7 by stevengj

Your 49x49 matrix is probably too big for StaticArrays to be effective. In general, StaticArrays loses most of its performance benefit (and can start to choke your compiler) as vectors/matrices start to grow above 100ish elements.

I think everything you want is covered by ldiv!(B_lu,A). It’s allocation-free and works on ForwardDiff.Dual matrices without modification. Note that this overwrites A with the result. If you want nondestructive solutions, you’ll want to copy A to a buffer (preallocated like Abuf = similar(A), then copy!(Abuf,A) for every call) and then solve in the buffer.

4 Likes

The benchmarks suggest otherwise. That said, I also find this puzzling since the relevant method (StaticArrays._A_ldiv_B) does not shortcut for β€œlarge” matrices, it is still a generated function which is essentially a tape of operations. Maybe it still fits in the CPU cache…

I don’t have a solution, but if I understood correctly you could try to use the sweep operator instead of the \ which is available in two packages:

If you don’t need to compute any extra statistics (like the sum of squares). The package from Joshday will likely be faster. I don’t think there is a need to do some allocation.
I didn’t check how different the results for the QC decomposition are from the sweep operator, and I suspect it is mainly related to the characteristic of your data (B).
Hope it helps,

If it is possible to work with A' instead of A, then you could just reinterpret the dual numbers as scalars to get an A matrix with more columns. Then apply B_lu and reinterpret back.

1 Like

Here are my benchmarks, using your definitions from above:

A_ss = SMatrix{n,m}(A) # no need to use MMatrix here
Abuf = similar(A);
@btime \($B_lu,$A); # 3.6us, 1 alloc # allocates new result
@btime ldiv!($B_lu,$A); # 2.8us, 0 alloc # overwrites A with result
@btime ldiv!($B_lu,copy!($Abuf,$A)); # 3.1us, 0 alloc # result saved to Abuf
@btime \($B_s_lu,$A_s); # 7.0us, 1 alloc # allocates new result
@btime \($B_s_lu,$A_ss); # 7.0us, 0 alloc # "allocates" new result

So for me, StaticArrays are 2x slower on this operation. Also, my terminal spins for 5-10s the first time I run a StaticArrays version while it compiles the unrolled version. Note that even the allocating Matrix version of this solve outperforms the StaticArrays versions.

There is no benefit to using a MArray over a SArray except for the ability to explicitly mutate it. The penalty is that operations on MArray are ocassionally slower, since they can be tougher for the compiler to reason about. An SArray is immutable and lives on the stack, so rarely results in allocations. SArray should be preferred over MArray where possible, which is most places.

1 Like

Just to make this thread more searchable for other future users (This question arises from time to time), it is better to add Solve linear system to the title. @Tamas_Papp , Could you do that?

1 Like

I think is a good suggestion, done. :slight_smile:

1 Like