First off, this isn’t advanced stuff, conceptually it should be straightforward. I’m almost sure this is just me missing something basic.
So I’m implementing an algorithm which is just a sequence of matrix multiplications. At one point I have S and A and I am told to compute K = A * inv(S). (Asterisk means regular matrix multiplication). The lore is “don’t compute inverses, solve the linear system”. So I’m trying to obtain K by solving the following system:
Solve K * S = A, to obtain K, given S and A.
These things have dimensions:
S (NxN)
K (FxN)
A (FxN)
N > F
I have managed to do this while allocating:
F = 5
N = 8
# this is an exercise, pretend we don't know K
K = rand(Float64, F, N)
# We know S and A:
S = rand(Float64, N, N)
A = K*S
fac = lu(S)
LinearAlgebra.rdiv!(A, fac) # output gets placed into A - weird but ok
A ≈ K # true, works!
Question, how do I make this allocation-free?
I see that lu! exists, which gave me hope that I could re-use the factorization object, for example by replacing fac = lu(S) with fac = lu!(fac, S). However, fac is type LU and the methods of lu! that take LU are
F = 5
N = 8
K = rand(Float64, F, N)
S = rand(Float64, N, N)
A = K*S
linws = LUWs(N);
res = LU(factorize!(linws, S)...)
LinearAlgebra.rdiv!(A, res)
A ≈ K # true
Our tests in LinearSolve.jl show it’s pretty slow though . I think there is some issue with how it’s wrapped that I haven’t been able to pin down. I would instead just recommend using RecursiveFactorization.jl and pass the pivoting workspace. Or if you use LinearSolve.jl then for AppleAccelerate, MKL, and RecrusiveFactorization.jl the LinearSolve interface takes care of this and has tests that it is non allocating.
Those are almost always a better choice than OpenBLAS for lu factorization anyways (chip dependent, but if you’re not on AMD EPYC then it is generally not a good choice, even AMD Ryzen prefers MKL!) so you might as well fix that choice and hit them directly if you’re already going this far.
One observation. I noticed is that the factorize! step in my case takes 500ns as seen above, but the rdiv! step takes 770ns. I was surprised because I thought the factorization step was expected to be more expensive…? Not sure if this tells you anything.
Regardless, thanks for the reference, I will try it and report back.
using LinearAlgebra
import FastLapackInterface
using BenchmarkTools
F = 5
N = 8
K = rand(Float64, F, N)
S = rand(Float64, N, N)
A = K*S
linws = FastLapackInterface.LUWs(N)
res = LU(FastLapackInterface.factorize!(linws, S)...)
LinearAlgebra.rdiv!(A, res)
A ≈ K
@btime res = LU(FastLapackInterface.factorize!($linws, $S)...) # 528.463ns
@btime LinearAlgebra.rdiv!($A, $res) # 772.364ns
The factorization step is more expensive than solving a single right-hand side, whereas here you have F right-hand sides (or left-hand-sides for right-division). The complexity of the factorization is O(N^3), whereas the complexity of F solves is O(N^2 F). Here, N=8 and F=5 are pretty close, so that you are in a battle of constant factors.
(The constant factors are also pretty distorted because LAPACK/BLAS is typically optimized mainly for larger matrices.)
If you really care about the 8 \times 8 case, and if the size is fixed in your inner loops (so that you can afford recompiling your code whenever N changes), you could use StaticArrays.jl to go faster for tiny matrices like this:
using StaticArrays, LinearAlgebra, BenchmarkTools
K = @SMatrix rand(5, 8);
S = @SMatrix rand(8, 8);
A = K * S;
S_LU = @btime lu($S);
B = @btime $A / $S_LU;
@show B ≈ K
which gives (on my M4 laptop):
176.951 ns (0 allocations: 0 bytes)
88.304 ns (0 allocations: 0 bytes)
B ≈ K = true
Note that we no longer have to worry about “in-place” operations with static arrays, because immutable “allocations” are essentially free.
Hey Steven thank you again for that. My use case is indeed small matrices, and also fixed size on inner loops, so your suggestion is totally relevant.
My PC has much worse single core performance than yours, so I benchmarked both approaches on my PC.
FastLapackInterface:
497.794ns / 787.495ns
StaticArrays:
484.738ns / 196.452ns
That said, I tried integrating StaticArrays into my code (meaning, actual code, not just benchmark) and I got allocations everywhere. No idea why. But the whole thing ends up awkward, because everywhere in my code I assume that the matrices are mutable, except these 3. So the code ends up awkward and it’s possible I made a mistake somewhere.
As a side question, I wonder if you understand where the speed up is coming from. In my mind, a StaticArray is just an array whose type contains information about the size of the array. I understand how that allows higher performance in situations where you can avoid allocations by putting static arrays on the stack. But in this specific case, where the alternative also doesn’t allocate, where is the performance improvement coming from?
StaticArrays are stack allocated, and all loops will be unrolled and - if possible - vectorized. This means you can achieve a 5..10x performance improvement for small arrays. Furthermore, if you need mutable arrays, try to use MVectors or MArrays from the package StaticArrays. Often this works well.
For testing, split your code in functions, and test the inner most function first. Use BenchmarkTools to test the performance and the allocations and go step-by-step from the most inner function to the outer functions.
My very own methodology for spotting allocations is I literally add inside the hot loop:
line1
println(@allocated begin
line2
end)
line3
If it prints all zeros, then that line doesn’t allocate, otherwise it does. Adding in prints might sound primitive but bear in mind my it’s not my first rodeo with Julia and this is by far the most reliable way I’ve found, and in fact that’s how I got the hot loop to exactly zero allocations in the non-static-arrays branch.
So back to our concrete example. The lines in the hot loop are schematically
and this allocates. Weird! A, S, K are all StaticArrays, how can that allocate? So I double check their types and print some values so that I can reproduce the allocations on its own. Here’s what I got:
So, as stand alone, this doesn’t allocate. But it allocates inside the hot loop… So I think, maybe it’s not the solving, but it’s the assignment somehow. So I split the solving and the assignment into separate lines (and throw in a type assert just in case). Inside the loop:
<CODE>
S_LU = lu(mystruct.S)
println(@allocated begin
K::SMatrix{5, 8, Float64, 40} = mystruct.A / S_LU
end)
mystruct.K = K
<CODE>
and this still allocates.
In my view this is an example of “code doesn’t allocate on its own, but when there’s other stuff around it, it does” - which I why I prefer to insert @allocated prints inside the hot loop instead of measuring allocations from benchmarks.