Solving linear systems without allocations, a comparison

I wanted to see how important removing allocations were when repeatedly solving linear systems (Ax=b), so I built a quick test package to try things out: GitHub - RobertGregg/QRLite.jl · GitHub. The interface uses FastLapackInterface.jl to store work-spaces for the LAPACK calls and the rest is just making sure I arrive at the same solution as the backslash operator (e.g., using pivoted QR). Below is an example of solving a system were A and b change at the same time:

using QRLite, BenchmarkTools

n = 50
k = 5

#A is not square to avoid \ from using LU instead of QR
A = rand(2n,n)
B = rand(2n,k)

linsolve = CachedQR(A,B)
x = solution(linsolve)

x ≈ A \ B #true

Anew = rand(2n,n)
Bnew = rand(2n,k)

#Change A and B at the same time
updateQR!(linsolve, Anew, Bnew)

solution(linsolve) ≈ Anew \ Bnew #true

#Check for allocations
@benchmark updateQR!($linsolve, $Anew, $Bnew)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.300 μs … 541.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.192 μs ±  10.592 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █   ▃ █
  ██▅▄▇█▇█▆█▄▇█▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂ ▃
  49.3 μs         Histogram: frequency by time         83.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Everything seems to be working. However, what I found surprising was what happened with the benchmark comparisons:

Avoiding allocations showed a decent speedup for small systems (e.g., A size = 20 by 10), but once you get past a few hundred rows and columns, the timings are nearly identical. This was also the case when more columns were added to b. From this, I would conclude avoiding allocations is really only important if you’re solving a huge number of small systems.

If people think this kind of package would be helpful I could polish it up and register it. :slightly_smiling_face: