Solving linear systems without allocations, a comparison

I wanted to see how important removing allocations were when repeatedly solving linear systems (Ax=b), so I built a quick test package to try things out: GitHub - RobertGregg/QRLite.jl · GitHub. The interface uses FastLapackInterface.jl to store work-spaces for the LAPACK calls and the rest is just making sure I arrive at the same solution as the backslash operator (e.g., using pivoted QR). Below is an example of solving a system were A and b change at the same time:

using QRLite, BenchmarkTools

n = 50
k = 5

#A is not square to avoid \ from using LU instead of QR
A = rand(2n,n)
B = rand(2n,k)

linsolve = CachedQR(A,B)
x = solution(linsolve)

x ≈ A \ B #true

Anew = rand(2n,n)
Bnew = rand(2n,k)

#Change A and B at the same time
updateQR!(linsolve, Anew, Bnew)

solution(linsolve) ≈ Anew \ Bnew #true

#Check for allocations
@benchmark updateQR!($linsolve, $Anew, $Bnew)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.300 μs … 541.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.192 μs ±  10.592 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █   ▃ █
  ██▅▄▇█▇█▆█▄▇█▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂ ▃
  49.3 μs         Histogram: frequency by time         83.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Everything seems to be working. However, what I found surprising was what happened with the benchmark comparisons:

Avoiding allocations showed a decent speedup for small systems (e.g., A size = 20 by 10), but once you get past a few hundred rows and columns, the timings are nearly identical. This was also the case when more columns were added to b. From this, I would conclude avoiding allocations is really only important if you’re solving a huge number of small systems.

If people think this kind of package would be helpful I could polish it up and register it. :slightly_smiling_face:

5 Likes

If you’re interested in independent small or mid-sized problems, distributing them over multiple threads is probably best (using BLAS.set_num_threads(1) to start). Without pre-allocation, there may be contention for access to memory management, so you may see bigger differences.

It would be interesting to see if recent work on garbage collection has ameliorated this; so if you build multi-threaded tests, I suggest trying them on several versions of Julia. Make sure each worker gets separate resources for the pre-allocated scheme, and bundle tiny problems into good task sizes.

2 Likes

I don’t have a huge need to solve small to mid sized problems, but I think combining this method with mutli-threading for that purpose could offer significant speedups. It might be nice to “subtract out” the time it takes to solve the linear systems to just focus on how much time you’re actually saving between allocating vs not.

I think a good way to test multiple Julia versions would be through Github actions?

Realize that the cost of these calcuations is O(n^3), whereas the cost of allocations and garbage collection is roughly O(n^2) for large n (i.e., linear in the amount of memory being allocated) and closer to O(1) for small allocations. So the overhead of allocations will quickly become irrelevant as n grows.

Similarly for the advantage of having things in-cache, which is another benefit of pre-allocating arrays. The cost of loading an array into cache is linear in the size of the data. Eventually, O(n^3) swamps this. This is also related to the reason why (carefully arranged) dense linear algebra is ultimately CPU bound (hitting nearly the peak flop rate for e.g. matrix–matrix multiplies) rather than being memory-bound.

Here is a little plot of the memory-allocation time @belapsed Vector{Float64}(undef, $n) as a function of n (for Julia 1.12.5 on an Apple M4):

(You can see it switching between different memory-management algorithms depending on n.)

If you have lots of really small matrices, e.g. n \lesssim 10, then it becomes worthwhile to start thinking about things like StaticArrays.jl.

4 Likes

People who only need to solve a system of a given size once can just do \. People who need to repetitively solve systems of the same size for multiple times could learn to use FastLapackInterface.jl. There doesn’t seem to be a need of a dedicated package just for this purpose.

1 Like

Both good points by @stevengj and @Norman. There is a very small gap between what something like static arrays would cover vs being the time bound by the large matrix multiplications (although static arrays will allocate in some instances). This probably doesn’t need a dedicated package, but it’s nice to know its possible.

I would think that some kind of hypothetical Base.unsafe_free!(mem::Memory) would cover some more of the gap.

Typically, your allocator (e.g. glibc ptmalloc malloc+free, and your OS kernel for large allocs that go via mmap) already has things like thread-local pools for rapid re-use. It is somewhat typical to have naive C-code that mallocs temp storage for a matrix, uses it, frees it, and then loops: APIs where you have to pre-allocate re-usable buffers are slightly inconvenient.

Unfortunately, this kind of rapid reuse caching does not work very well with garbage collection: The freeing happens nondeterministic and much later, causing a false overlap of lifetimes.

Languages like java/c# have a harder time doing that, because they want to be “memory safe per policy/compliance”. Julia does not have the design goal of being “certifiably” memory safe, so we could support that.

1 Like