Solve linear system repeatedly without allocation

Tamas_Papp · July 7, 2022, 2:06pm

I have a B \ A calculation I need to solve repeatedly (10^4–10^5 times) with

the same B,
ideally without allocation, that includes the result
A is a mutable buffer I want to write to
the elements of A may be ForwardDiff.Dual

I narrowed down a solution that I like (factorize, go static), but I would appreciate suggestions, so I am posting self-contained benchmarks. Julia 1.8, latest package versions.

Some very simple benchmarks that reflect the dimensions of the problem.

Setup

using StaticArrays, BenchmarkTools, LinearAlgebra

n = 49
m = 6
T = Float64
B = rand(T, n, n);
A = rand(T, n, m);
B_lu = lu(B);
A_s = MMatrix{n,m}(A); # recall, I will construct A again and again
B_s = SMatrix{n,n}(B);
B_s_lu = lu(B_s);
# lu with overwritten A which ends up static
function f(::Val{M}, ::Val{N}, A, B) where {M,N}
    ldiv!(B, A)
    SMatrix{M,N}(A)
end

The benchmarks with `Float64`

julia> @benchmark $B \ $A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  22.032 μs …  3.351 ms  ┊ GC (min … max): 0.00% … 90.74%
 Time  (median):     24.108 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.990 μs ± 47.428 μs  ┊ GC (mean ± σ):  2.24% ±  1.28%

  ▂▆███▇▆▅▅▄▃▂▃▂▂▂▂▂▁▁▁▁ ▁▂▃▃▂▁▁▁▁ ▁▁ ▁▁▁▁▁▁                  ▂
  ████████████████████████████████████████████▇█▇▆▆▆▅▅▇▆▅▅▅▆▇ █
  22 μs        Histogram: log(frequency) by time      46.2 μs <

 Memory estimate: 21.73 KiB, allocs estimate: 4.

julia> @benchmark $B_lu \ $A # factorize
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.294 μs … 796.204 μs  ┊ GC (min … max): 0.00% … 92.55%
 Time  (median):     5.023 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.380 μs ±   7.989 μs  ┊ GC (mean ± σ):  1.37% ±  0.93%

       ▃▆██▆▇▄▂                                                
  ▁▂▃▅▇█████████▇▆▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▃▂▃▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  4.29 μs         Histogram: frequency by time        8.34 μs <

 Memory estimate: 2.44 KiB, allocs estimate: 1.

julia> @benchmark $B_s_lu \ $A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.937 μs … 90.601 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.157 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.725 μs ±  1.634 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂██▅▃▂▁▂▂▁▁▃▂▁▃▂▁▁▂▁▁ ▁▂▁ ▁▁▁   ▁        ▁▂▂▁              ▂
  ██████████████████████████████▇███▅▆▇██▇▆█████▆▆▁▅▆▆▄▄▃▅▄▄ █
  8.94 μs      Histogram: log(frequency) by time     14.5 μs <

 Memory estimate: 2.38 KiB, allocs estimate: 1.

julia> @benchmark f($Val(m), $Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.372 μs … 877.653 μs  ┊ GC (min … max): 0.00% … 92.07%
 Time  (median):     5.996 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.576 μs ±  15.013 μs  ┊ GC (mean ± σ):  3.68% ±  1.60%

    ▂▂███▆▅▃▁                                                  
  ▁▄██████████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  5.37 μs         Histogram: frequency by time        9.89 μs <

 Memory estimate: 2.38 KiB, allocs estimate: 1.

The benchmarks with `ForwardDiff.Dual`

A bit of setup, then rerun the same code (incl setup above):

import ForwardDiff
_dual(x) = ForwardDiff.Dual(x, Tuple(randn(5)))
A = _dual.(A);

julia> @benchmark $B \ $A # the naive \
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  115.994 μs …  3.557 ms  ┊ GC (min … max): 0.00% … 84.93%
 Time  (median):     128.175 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   136.982 μs ± 64.630 μs  ┊ GC (mean ± σ):  1.25% ±  2.80%

  ▃█▆▄▅▆▆▄▃▂▃▃▃▂▃▂▂▂▁▁▂▁▁▁▁▁▁▁▁                                ▂
  █████████████████████████████████▆▇▇▆▇▆▆▅▅▅▆▅▆▆▅▅▅▅▆▄▅▄▅▄▅▃▄ █
  116 μs        Histogram: log(frequency) by time       247 μs <

 Memory estimate: 145.84 KiB, allocs estimate: 6.

julia> @benchmark $B_lu \ $A # factorize
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   96.735 μs …  7.046 ms  ┊ GC (min … max): 0.00% … 84.50%
 Time  (median):     111.906 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   121.081 μs ± 90.169 μs  ┊ GC (mean ± σ):  1.71% ±  2.82%

   ▅▆▄▃█▇▄▃▃▃▃▂▂▂▁▂▂▁▁▁▁▁▁▃▃                                   ▂
  ███████████████████████████▇▇▆▇▇▆▆▅▅▅▆▄▅▅▆▆▄▅▆▄▅▃▅▄▁▄▄▅▅▄▅▄▆ █
  96.7 μs       Histogram: log(frequency) by time       251 μs <

 Memory estimate: 126.55 KiB, allocs estimate: 3.

julia> @benchmark $B_s_lu \ $A_s # static lu
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.938 μs … 183.102 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.117 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.752 μs ±   2.930 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▄▂▂▁▁▁▁ ▂▃▂▁▂▂▁▁▂▂▁  ▁▁   ▁    ▁          ▂▁             ▂
  ███████████████████████████▇███▇████▇▆▆▆██▇▆▇████▇▆▆▅▅▆▅▅▁▅ █
  8.94 μs      Histogram: log(frequency) by time      14.1 μs <

 Memory estimate: 2.38 KiB, allocs estimate: 1.

julia> @benchmark f($Val(m), $Val(n), C, B_lu) setup = begin C = copy(A) end # f
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  27.163 μs …  2.311 ms  ┊ GC (min … max): 0.00% … 89.31%
 Time  (median):     28.529 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.160 μs ± 23.206 μs  ┊ GC (mean ± σ):  0.68% ±  0.89%

  ▆█▆▇▅▅▇▄▃▄▃▃▄▂▂▂ ▁▁  ▂▁  ▁        ▁  ▁▁▁                    ▂
  ████████████████████████▇███▇██▇▄▆██████▇▇▆▇▆▅▆▆▆▆▇▆▅▄▅▄▅▄▅ █
  27.2 μs      Histogram: log(frequency) by time        46 μs <

 Memory estimate: 13.88 KiB, allocs estimate: 1.

baggepinnen · July 7, 2022, 2:09pm

Can’t you horizontally stack all A and do a single solve?

Tamas_Papp · July 7, 2022, 2:11pm

Unfortunately not, it is an iterative problem (each iteration does this to calculate something, within a minimizer).

PeterSimon · July 7, 2022, 2:37pm

Check out FastLapackInterface

Edit: Sorry, I initially missed the part where the elements of A may be ForwardDiff.Dual. FastLapackInterface isn’t applicable there.

lmiq · July 7, 2022, 3:29pm

Maybe this helps: Non-allocating matrix inversion - #7 by stevengj

mikmoore · July 7, 2022, 3:35pm

Your 49x49 matrix is probably too big for StaticArrays to be effective. In general, StaticArrays loses most of its performance benefit (and can start to choke your compiler) as vectors/matrices start to grow above 100ish elements.

I think everything you want is covered by ldiv!(B_lu,A). It’s allocation-free and works on ForwardDiff.Dual matrices without modification. Note that this overwrites A with the result. If you want nondestructive solutions, you’ll want to copy A to a buffer (preallocated like Abuf = similar(A), then copy!(Abuf,A) for every call) and then solve in the buffer.

Tamas_Papp · July 8, 2022, 7:54am

The benchmarks suggest otherwise. That said, I also find this puzzling since the relevant method (StaticArrays._A_ldiv_B) does not shortcut for “large” matrices, it is still a generated function which is essentially a tape of operations. Maybe it still fits in the CPU cache…

Eric · July 8, 2022, 9:12am

I don’t have a solution, but if I understood correctly you could try to use the sweep operator instead of the \ which is available in two packages:

If you don’t need to compute any extra statistics (like the sum of squares). The package from Joshday will likely be faster. I don’t think there is a need to do some allocation.
I didn’t check how different the results for the QC decomposition are from the sweep operator, and I suspect it is mainly related to the characteristic of your data (B).
Hope it helps,

Per · July 8, 2022, 10:54am

If it is possible to work with A' instead of A, then you could just reinterpret the dual numbers as scalars to get an A matrix with more columns. Then apply B_lu and reinterpret back.

mikmoore · July 8, 2022, 2:14pm

Here are my benchmarks, using your definitions from above:

A_ss = SMatrix{n,m}(A) # no need to use MMatrix here
Abuf = similar(A);
@btime \($B_lu,$A); # 3.6us, 1 alloc # allocates new result
@btime ldiv!($B_lu,$A); # 2.8us, 0 alloc # overwrites A with result
@btime ldiv!($B_lu,copy!($Abuf,$A)); # 3.1us, 0 alloc # result saved to Abuf
@btime \($B_s_lu,$A_s); # 7.0us, 1 alloc # allocates new result
@btime \($B_s_lu,$A_ss); # 7.0us, 0 alloc # "allocates" new result

So for me, StaticArrays are 2x slower on this operation. Also, my terminal spins for 5-10s the first time I run a StaticArrays version while it compiles the unrolled version. Note that even the allocating Matrix version of this solve outperforms the StaticArrays versions.

There is no benefit to using a MArray over a SArray except for the ability to explicitly mutate it. The penalty is that operations on MArray are ocassionally slower, since they can be tougher for the compiler to reason about. An SArray is immutable and lives on the stack, so rarely results in allocations. SArray should be preferred over MArray where possible, which is most places.

RoyiAvital · July 8, 2022, 4:16pm

Just to make this thread more searchable for other future users (This question arises from time to time), it is better to add Solve linear system to the title. @Tamas_Papp , Could you do that?

suavesito · July 9, 2022, 12:54am

I think is a good suggestion, done.

Topic		Replies	Views
Solving linear system without allocating Numerics	19	494	August 9, 2025
Solve a system of linear equations many times without allocating memory Numerics	18	2432	April 7, 2020
DifferentiationInterface+ForwardDiff unable to diff ldiv! (any other options to solve linear system w/o allocations and differentiate?) Performance question	34	243	July 28, 2025
Non-allocating matrix inversion Performance matrices	13	3577	April 23, 2022
Help with allocations (variable length least squares) New to Julia memory-allocation	10	1160	March 25, 2021

Solve linear system repeatedly without allocation

Setup

The benchmarks with Float64

The benchmarks with ForwardDiff.Dual

Related topics

The benchmarks with `Float64`

The benchmarks with `ForwardDiff.Dual`