Solving linear system without allocating

First off, this isn’t advanced stuff, conceptually it should be straightforward. I’m almost sure this is just me missing something basic.

So I’m implementing an algorithm which is just a sequence of matrix multiplications. At one point I have S and A and I am told to compute K = A * inv(S). (Asterisk means regular matrix multiplication). The lore is “don’t compute inverses, solve the linear system”. So I’m trying to obtain K by solving the following system:

Solve K * S = A, to obtain K, given S and A.

These things have dimensions:

S (NxN)
K (FxN)
A (FxN)
N > F

I have managed to do this while allocating:

F = 5
N = 8
# this is an exercise, pretend we don't know K
K = rand(Float64, F, N)
# We know S and A:
S = rand(Float64, N, N)
A = K*S

fac = lu(S)
LinearAlgebra.rdiv!(A, fac)  # output gets placed into A - weird but ok
A ≈ K  # true, works!

Question, how do I make this allocation-free?

I see that lu! exists, which gave me hope that I could re-use the factorization object, for example by replacing fac = lu(S) with fac = lu!(fac, S). However, fac is type LU and the methods of lu! that take LU are

 [11] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal, pivot::Union{NoPivot, RowMaximum}; check, allowsingular)
     @ ~/julia-1.11.5/share/julia/stdlib/v1.11/LinearAlgebra/src/lu.jl:580
 [12] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal; ...)
     @ ~/julia-1.11.5/share/julia/stdlib/v1.11/LinearAlgebra/src/lu.jl:580

But my matrix S isn’t tridiagonal…

Thank you.

1 Like

An lu! method for general matrices is added in Julia 1.12 1.13: define lu!(F::LU,A) for general matrices by longemen3000 · Pull Request #1131 · JuliaLang/LinearAlgebra.jl · GitHub

3 Likes

Do you know if that’s within the v1.12.0-beta4 release? I’d like to try it right now.

It should be in the beta, yes. Sorry, it was merged after the feature freeze for 1.12, so it won’t be released until 1.13.

GitHub - DynareJulia/FastLapackInterface.jl might also be an option?

4 Likes

I don’t think it is

# 12 methods for generic function "lu!" from LinearAlgebra:
  [1] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal, pivot::Union{NoPivot, RowMaximum}; check, allowsingular)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:580
  [2] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal; ...)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:580
  [3] lu!(A::Tridiagonal{T, V}, pivot::Union{NoPivot, RowMaximum}; check, allowsingular) where {T, V}
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:569
  [4] lu!(A::StridedMatrix{T}, ::RowMaximum; check, allowsingular) where T<:Union{Float32, Float64, ComplexF64, ComplexF32}
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:90
  [5] lu!(A::Union{Hermitian{T, S}, Symmetric{T, S}} where S, pivot::Union{NoPivot, RowMaximum, RowNonZero}; check, allowsingular) where T
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:95
  [6] lu!(A::AbstractMatrix, pivot::Union{NoPivot, RowMaximum, RowNonZero}; check, allowsingular)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:151
  [7] lu!(A::Tridiagonal{T, V}; ...) where {T, V}
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:569
  [8] lu!(A::StridedMatrix{var"#s4709"} where var"#s4709"<:Union{Float32, Float64, ComplexF64, ComplexF32}; check, allowsingular)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:89
  [9] lu!(A::Union{Hermitian{T, S}, Symmetric{T, S}} where S; ...) where T
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:95
 [10] lu!(A::AbstractMatrix; ...)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:151
 [11] lu!(A::Union{Hermitian{T, S} where {T, S}, Symmetric{T, S} where {T, S}, Tridiagonal, StridedMatrix}, ::Val{true}; check)
     @ deprecated.jl:213
 [12] lu!(A::Union{Hermitian{T, S} where {T, S}, Symmetric{T, S} where {T, S}, Tridiagonal, StridedMatrix}, ::Val{false}; check)
     @ deprecated.jl:213

But ok, it’s encouraging to see that it’s coming, thank you.

The goal of FastLapackInterface is to eliminate any temporary allocations when using certain LAPACK functions compared to Base julia.

This sounds exactly what I want, thank you. I’ll try.

Ooo lala

F = 5
N = 8
K = rand(Float64, F, N)
S = rand(Float64, N, N)
A = K*S

linws = LUWs(N);
res = LU(factorize!(linws, S)...)
LinearAlgebra.rdiv!(A, res)
A ≈ K  # true

I’ll take it, thank you!

EDIT: for completeness

@btime res = LU(factorize!($linws, $S)...)
500.886 ns (0 allocations: 0 bytes)
1 Like

Oh, sorry, looks like it was merged after the feature freeze for 1.12, so we won’t get it until 1.13.

The docstring needs to be fixed: lu!(F, ::Matrix) requires Julia 1.13 by stevengj · Pull Request #1395 · JuliaLang/LinearAlgebra.jl · GitHub

That explains it :slight_smile: Thank you nonetheless. For now I’ll use the other approach.

Our tests in LinearSolve.jl show it’s pretty slow though :sweat_smile:. I think there is some issue with how it’s wrapped that I haven’t been able to pin down. I would instead just recommend using RecursiveFactorization.jl and pass the pivoting workspace. Or if you use LinearSolve.jl then for AppleAccelerate, MKL, and RecrusiveFactorization.jl the LinearSolve interface takes care of this and has tests that it is non allocating.

Those are almost always a better choice than OpenBLAS for lu factorization anyways (chip dependent, but if you’re not on AMD EPYC then it is generally not a good choice, even AMD Ryzen prefers MKL!) so you might as well fix that choice and hit them directly if you’re already going this far.

1 Like

One observation. I noticed is that the factorize! step in my case takes 500ns as seen above, but the rdiv! step takes 770ns. I was surprised because I thought the factorization step was expected to be more expensive…? Not sure if this tells you anything.

Regardless, thanks for the reference, I will try it and report back.

using LinearAlgebra
import FastLapackInterface

using BenchmarkTools
F = 5
N = 8
K = rand(Float64, F, N)
S = rand(Float64, N, N)
A = K*S

linws = FastLapackInterface.LUWs(N)
res = LU(FastLapackInterface.factorize!(linws, S)...)
LinearAlgebra.rdiv!(A, res)
A ≈ K

@btime res = LU(FastLapackInterface.factorize!($linws, $S)...)  # 528.463ns
@btime LinearAlgebra.rdiv!($A, $res)  # 772.364ns

The factorization step is more expensive than solving a single right-hand side, whereas here you have F right-hand sides (or left-hand-sides for right-division). The complexity of the factorization is O(N^3), whereas the complexity of F solves is O(N^2 F). Here, N=8 and F=5 are pretty close, so that you are in a battle of constant factors.

(The constant factors are also pretty distorted because LAPACK/BLAS is typically optimized mainly for larger matrices.)

If you really care about the 8 \times 8 case, and if the size is fixed in your inner loops (so that you can afford recompiling your code whenever N changes), you could use StaticArrays.jl to go faster for tiny matrices like this:

using StaticArrays, LinearAlgebra, BenchmarkTools
K = @SMatrix rand(5, 8);
S = @SMatrix rand(8, 8);
A = K * S;
S_LU = @btime lu($S);
B = @btime $A / $S_LU;
@show B ≈ K

which gives (on my M4 laptop):

  176.951 ns (0 allocations: 0 bytes)
  88.304 ns (0 allocations: 0 bytes)
B ≈ K = true

Note that we no longer have to worry about “in-place” operations with static arrays, because immutable “allocations” are essentially free.

3 Likes

Hey Steven thank you again for that. My use case is indeed small matrices, and also fixed size on inner loops, so your suggestion is totally relevant.

My PC has much worse single core performance than yours, so I benchmarked both approaches on my PC.

FastLapackInterface:
497.794ns / 787.495ns

StaticArrays:
484.738ns / 196.452ns

That said, I tried integrating StaticArrays into my code (meaning, actual code, not just benchmark) and I got allocations everywhere. No idea why. But the whole thing ends up awkward, because everywhere in my code I assume that the matrices are mutable, except these 3. So the code ends up awkward and it’s possible I made a mistake somewhere.

As a side question, I wonder if you understand where the speed up is coming from. In my mind, a StaticArray is just an array whose type contains information about the size of the array. I understand how that allows higher performance in situations where you can avoid allocations by putting static arrays on the stack. But in this specific case, where the alternative also doesn’t allocate, where is the performance improvement coming from?

StaticArrays are stack allocated, and all loops will be unrolled and - if possible - vectorized. This means you can achieve a 5..10x performance improvement for small arrays. Furthermore, if you need mutable arrays, try to use MVectors or MArrays from the package StaticArrays. Often this works well.

For testing, split your code in functions, and test the inner most function first. Use BenchmarkTools to test the performance and the allocations and go step-by-step from the most inner function to the outer functions.

1 Like