Solving linear system without allocating

freeman · June 28, 2025, 3:26pm

First off, this isn’t advanced stuff, conceptually it should be straightforward. I’m almost sure this is just me missing something basic.

So I’m implementing an algorithm which is just a sequence of matrix multiplications. At one point I have S and A and I am told to compute K = A * inv(S). (Asterisk means regular matrix multiplication). The lore is “don’t compute inverses, solve the linear system”. So I’m trying to obtain K by solving the following system:

Solve K * S = A, to obtain K, given S and A.

These things have dimensions:

S (NxN)
K (FxN)
A (FxN)
N > F

I have managed to do this while allocating:

F = 5
N = 8
# this is an exercise, pretend we don't know K
K = rand(Float64, F, N)
# We know S and A:
S = rand(Float64, N, N)
A = K*S

fac = lu(S)
LinearAlgebra.rdiv!(A, fac)  # output gets placed into A - weird but ok
A ≈ K  # true, works!

Question, how do I make this allocation-free?

I see that lu! exists, which gave me hope that I could re-use the factorization object, for example by replacing fac = lu(S) with fac = lu!(fac, S). However, fac is type LU and the methods of lu! that take LU are

 [11] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal, pivot::Union{NoPivot, RowMaximum}; check, allowsingular)
     @ ~/julia-1.11.5/share/julia/stdlib/v1.11/LinearAlgebra/src/lu.jl:580
 [12] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal; ...)
     @ ~/julia-1.11.5/share/julia/stdlib/v1.11/LinearAlgebra/src/lu.jl:580

But my matrix S isn’t tridiagonal…

Thank you.

stevengj · June 28, 2025, 4:00pm

An lu! method for general matrices is added in Julia ~~1.12~~ 1.13: define lu!(F::LU,A) for general matrices by longemen3000 · Pull Request #1131 · JuliaLang/LinearAlgebra.jl · GitHub

freeman · June 28, 2025, 4:03pm

Do you know if that’s within the v1.12.0-beta4 release? I’d like to try it right now.

stevengj · June 28, 2025, 4:03pm

~~It should be in the beta, yes.~~ Sorry, it was merged after the feature freeze for 1.12, so it won’t be released until 1.13.

gdalle · June 28, 2025, 4:12pm

GitHub - DynareJulia/FastLapackInterface.jl might also be an option?

freeman · June 28, 2025, 4:17pm

I don’t think it is

# 12 methods for generic function "lu!" from LinearAlgebra:
  [1] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal, pivot::Union{NoPivot, RowMaximum}; check, allowsingular)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:580
  [2] lu!(F::LU{<:Any, <:Tridiagonal}, A::Tridiagonal; ...)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:580
  [3] lu!(A::Tridiagonal{T, V}, pivot::Union{NoPivot, RowMaximum}; check, allowsingular) where {T, V}
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:569
  [4] lu!(A::StridedMatrix{T}, ::RowMaximum; check, allowsingular) where T<:Union{Float32, Float64, ComplexF64, ComplexF32}
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:90
  [5] lu!(A::Union{Hermitian{T, S}, Symmetric{T, S}} where S, pivot::Union{NoPivot, RowMaximum, RowNonZero}; check, allowsingular) where T
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:95
  [6] lu!(A::AbstractMatrix, pivot::Union{NoPivot, RowMaximum, RowNonZero}; check, allowsingular)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:151
  [7] lu!(A::Tridiagonal{T, V}; ...) where {T, V}
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:569
  [8] lu!(A::StridedMatrix{var"#s4709"} where var"#s4709"<:Union{Float32, Float64, ComplexF64, ComplexF32}; check, allowsingular)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:89
  [9] lu!(A::Union{Hermitian{T, S}, Symmetric{T, S}} where S; ...) where T
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:95
 [10] lu!(A::AbstractMatrix; ...)
     @ ~/julia-1.12.0-beta4/share/julia/stdlib/v1.12/LinearAlgebra/src/lu.jl:151
 [11] lu!(A::Union{Hermitian{T, S} where {T, S}, Symmetric{T, S} where {T, S}, Tridiagonal, StridedMatrix}, ::Val{true}; check)
     @ deprecated.jl:213
 [12] lu!(A::Union{Hermitian{T, S} where {T, S}, Symmetric{T, S} where {T, S}, Tridiagonal, StridedMatrix}, ::Val{false}; check)
     @ deprecated.jl:213

But ok, it’s encouraging to see that it’s coming, thank you.

freeman · June 28, 2025, 4:20pm

The goal of FastLapackInterface is to eliminate any temporary allocations when using certain LAPACK functions compared to Base julia.

This sounds exactly what I want, thank you. I’ll try.

freeman · June 28, 2025, 4:38pm

Ooo lala

F = 5
N = 8
K = rand(Float64, F, N)
S = rand(Float64, N, N)
A = K*S

linws = LUWs(N);
res = LU(factorize!(linws, S)...)
LinearAlgebra.rdiv!(A, res)
A ≈ K  # true

I’ll take it, thank you!

EDIT: for completeness

@btime res = LU(factorize!($linws, $S)...)
500.886 ns (0 allocations: 0 bytes)

stevengj · June 28, 2025, 5:27pm

Oh, sorry, looks like it was merged after the feature freeze for 1.12, so we won’t get it until 1.13.

The docstring needs to be fixed: lu!(F, ::Matrix) requires Julia 1.13 by stevengj · Pull Request #1395 · JuliaLang/LinearAlgebra.jl · GitHub

freeman · June 28, 2025, 5:27pm

That explains it Thank you nonetheless. For now I’ll use the other approach.

ChrisRackauckas · June 29, 2025, 12:45am

Our tests in LinearSolve.jl show it’s pretty slow though . I think there is some issue with how it’s wrapped that I haven’t been able to pin down. I would instead just recommend using RecursiveFactorization.jl and pass the pivoting workspace. Or if you use LinearSolve.jl then for AppleAccelerate, MKL, and RecrusiveFactorization.jl the LinearSolve interface takes care of this and has tests that it is non allocating.

Those are almost always a better choice than OpenBLAS for lu factorization anyways (chip dependent, but if you’re not on AMD EPYC then it is generally not a good choice, even AMD Ryzen prefers MKL!) so you might as well fix that choice and hit them directly if you’re already going this far.

freeman · June 29, 2025, 10:38am

One observation. I noticed is that the factorize! step in my case takes 500ns as seen above, but the rdiv! step takes 770ns. I was surprised because I thought the factorization step was expected to be more expensive…? Not sure if this tells you anything.

Regardless, thanks for the reference, I will try it and report back.

using LinearAlgebra
import FastLapackInterface

using BenchmarkTools
F = 5
N = 8
K = rand(Float64, F, N)
S = rand(Float64, N, N)
A = K*S

linws = FastLapackInterface.LUWs(N)
res = LU(FastLapackInterface.factorize!(linws, S)...)
LinearAlgebra.rdiv!(A, res)
A ≈ K

@btime res = LU(FastLapackInterface.factorize!($linws, $S)...)  # 528.463ns
@btime LinearAlgebra.rdiv!($A, $res)  # 772.364ns

stevengj · June 29, 2025, 12:07pm

The factorization step is more expensive than solving a single right-hand side, whereas here you have F right-hand sides (or left-hand-sides for right-division). The complexity of the factorization is O(N^3), whereas the complexity of F solves is O(N^2 F). Here, N=8 and F=5 are pretty close, so that you are in a battle of constant factors.

(The constant factors are also pretty distorted because LAPACK/BLAS is typically optimized mainly for larger matrices.)

If you really care about the 8 \times 8 case, and if the size is fixed in your inner loops (so that you can afford recompiling your code whenever N changes), you could use StaticArrays.jl to go faster for tiny matrices like this:

using StaticArrays, LinearAlgebra, BenchmarkTools
K = @SMatrix rand(5, 8);
S = @SMatrix rand(8, 8);
A = K * S;
S_LU = @btime lu($S);
B = @btime $A / $S_LU;
@show B ≈ K

which gives (on my M4 laptop):

  176.951 ns (0 allocations: 0 bytes)
  88.304 ns (0 allocations: 0 bytes)
B ≈ K = true

Note that we no longer have to worry about “in-place” operations with static arrays, because immutable “allocations” are essentially free.

freeman · June 29, 2025, 8:06pm

Hey Steven thank you again for that. My use case is indeed small matrices, and also fixed size on inner loops, so your suggestion is totally relevant.

My PC has much worse single core performance than yours, so I benchmarked both approaches on my PC.

FastLapackInterface:
497.794ns / 787.495ns

StaticArrays:
484.738ns / 196.452ns

That said, I tried integrating StaticArrays into my code (meaning, actual code, not just benchmark) and I got allocations everywhere. No idea why. But the whole thing ends up awkward, because everywhere in my code I assume that the matrices are mutable, except these 3. So the code ends up awkward and it’s possible I made a mistake somewhere.

As a side question, I wonder if you understand where the speed up is coming from. In my mind, a StaticArray is just an array whose type contains information about the size of the array. I understand how that allows higher performance in situations where you can avoid allocations by putting static arrays on the stack. But in this specific case, where the alternative also doesn’t allocate, where is the performance improvement coming from?

ufechner7 · June 29, 2025, 8:23pm

StaticArrays are stack allocated, and all loops will be unrolled and - if possible - vectorized. This means you can achieve a 5..10x performance improvement for small arrays. Furthermore, if you need mutable arrays, try to use MVectors or MArrays from the package StaticArrays. Often this works well.

For testing, split your code in functions, and test the inner most function first. Use BenchmarkTools to test the performance and the allocations and go step-by-step from the most inner function to the outer functions.

freeman · August 9, 2025, 2:14pm

I had a look at this again and it’s not pretty.

So in the hot loop I have

line1
line2
line3

My very own methodology for spotting allocations is I literally add inside the hot loop:

line1
println(@allocated begin
line2
end)
line3

If it prints all zeros, then that line doesn’t allocate, otherwise it does. Adding in prints might sound primitive but bear in mind my it’s not my first rodeo with Julia and this is by far the most reliable way I’ve found, and in fact that’s how I got the hot loop to exactly zero allocations in the non-static-arrays branch.

So back to our concrete example. The lines in the hot loop are schematically

<OTHER CODE>
S_LU = lu(mystruct.S)
mystruct.K = mystruct.A / S_LU
<OTHER CODE>

where my struct is a mutable struct where all fields are const except for A, S, K, which are fully typed StaticArrays.

So now I run

<OTHER CODE>
S_LU = lu(mystruct.S)
println(@allocated begin
mystruct.K = mystruct.A / S_LU
end)
<OTHER CODE>

and this allocates. Weird! A, S, K are all StaticArrays, how can that allocate? So I double check their types and print some values so that I can reproduce the allocations on its own. Here’s what I got:

using BenchmarkTools


S = [1.5040606901873348 0.003932987145961394 0.0035199845590254343 0.003318654956997074 0.005784130154130035 0.020912034891258553 0.036273129167507974 0.052657573136273006; 0.003932987146000027 7.5040701122351425 0.004839056043014466 0.007386809361654664 0.015057125896754245 0.0433730319506961 0.0827424279682972 0.11719380772743648; 0.0035199845592450278 0.0048390560431798395 15.01101533399375 0.026762328544802476 0.06160699999246081 0.1625538944713099 0.32059321163667903 0.45053531153249365; 0.0033186549576740935 0.007386809362177175 0.02676232854486874 30.077587588848992 0.19422446848040809 0.5292678468809188 1.0183269819256897 1.4402566111769255; 0.00578413015576465 0.015057125897900033 0.061606999991892994 0.19422446847792285 8.031131758305106 1.5410910661323343 2.856834312483199 4.076632520910726; 0.020912034894322452 0.04337303195374423 0.16255389447434473 0.5292678468841502 1.5410910661363808 19.724234362332677 8.626338491372492 12.2976152790762; 0.036273129168609 0.08274242797212342 0.3205932116515368 1.0183269819606704 2.8568343125349673 8.626338491386388 61.16816548321803 22.864377372688228; 0.052657573124293686 0.1171938077121838 0.4505353115034093 1.4402566111199842 4.076632520821947 12.297615279062853 22.864377372813003 130.42197455063436]
A = [0.004060690187334805 0.003932987145961394 0.0035199845590254343 0.003318654956997074 0.005784130154130035 0.020912034891258553 0.036273129167507974 0.052657573136273006; -2.785852274059245e-7 2.5854979303259085e-7 2.6398365947610384e-6 8.103235425707195e-6 1.818317497154825e-5 4.326911615827487e-5 9.057630300956843e-5 0.00012542235736596834; 4.4411238428047105e-6 2.8257641631922907e-6 -1.2535046146695316e-6 3.9881598533432145e-6 6.557753051018344e-5 0.00031034369212422315 0.0004347151742530351 0.0006763237046751447; -8.93662977780141e-6 -4.899872159991584e-6 5.795294687468454e-6 -3.6684268229482127e-6 -0.00014106287551648582 -0.0006742040030120543 -0.0008809715727275395 -0.0014322896034849079; 5.082923706970386e-6 2.094276127437224e-6 -6.715871509568328e-6 -6.024406289856117e-6 6.848644193430729e-5 0.00036037486345377127 0.00041430187140049603 0.0007310491092411832]
S = SMatrix{8, 8, Float64, 64}(S)
A = SMatrix{5, 8, Float64, 40}(A)

S_LU = lu(S)
@btime K = $A / $S_LU  # 159.661 ns (0 allocations: 0 bytes)

So, as stand alone, this doesn’t allocate. But it allocates inside the hot loop… So I think, maybe it’s not the solving, but it’s the assignment somehow. So I split the solving and the assignment into separate lines (and throw in a type assert just in case). Inside the loop:

<CODE>
S_LU = lu(mystruct.S)
println(@allocated begin
K::SMatrix{5, 8, Float64, 40} = mystruct.A / S_LU
end)
mystruct.K = K
<CODE>

and this still allocates.

In my view this is an example of “code doesn’t allocate on its own, but when there’s other stuff around it, it does” - which I why I prefer to insert @allocated prints inside the hot loop instead of measuring allocations from benchmarks.

@stevengj Thoughts?

Next I’ll try LinearSolve.jl as Chris suggested.

freeman · August 9, 2025, 3:21pm

Hey @ChrisRackauckas basic question.

The LinearSolve.jl docs explain that to define the problem for Au=b you use syntax LS.prob(A,b).

However, in the problem I’m trying to solve, the variable is on the other side: K is on the left:

KS=A

How do I do this?

I could transpose one into the other by taking transpose on both side:

S^T K^T = A^T

but I imagine that LS.prob(transpose(S), transpose(A)) isn’t the best option…?

In fact just checked it doesn’t work:

S = [11.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0; 10.0 2.506900664487897e6 1.433657535085181e7 4.481739023725731e7 1.1262262448847617e8 3.0193660692277575e8 6.130728573512669e8 1.0463461104900966e9; 10.0 1.433657535085181e7 8.200967392126097e7 2.5648475220822597e8 6.450813919183959e8 1.734379832067834e9 3.531832588893946e9 6.045025990398865e9; 10.0 4.481739023725731e7 2.5648475220822594e8 8.028031816993352e8 2.02225747183716e9 5.465462335060141e9 1.1189456207841412e10 1.9253848341849712e10; 10.0 1.1262262448847617e8 6.450813919183959e8 2.0222574718371603e9 5.10939006521565e9 1.3951819173392893e10 2.887555149976558e10 5.023969845781318e10; 10.0 3.019366069227757e8 1.7343798320678337e9 5.46546233506014e9 1.395181917339289e10 3.9559808207794685e10 8.544724001357639e10 1.5573053701308295e11; 10.0 6.130728573512669e8 3.5318325888939457e9 1.1189456207841412e10 2.887555149976558e10 8.544724001357639e10 1.9426084161027313e11 3.7488913095021216e11; 10.0 1.0463461104900966e9 6.045025990398865e9 1.9253848341849712e10 5.023969845781317e10 1.5573053701308298e11 3.7488913095021216e11 7.703726772866714e11]
A = [10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0; 0.0 5006.811839240522 28631.021774839286 89490.02790315454 224818.62641690276 602173.4001833082 1.2215543492023705e6 2.0829614734740902e6; 0.0 25.823121056965718 600.7343309464657 4416.097663835407 23281.469024060723 169966.530459612 565152.8268909922 1.3295219032387044e6; 0.0 0.13318526889635574 14.124040419508251 253.31620420971885 2883.2276971825295 59902.53351805139 312044.9494810975 990307.3743137455; 0.0 0.0006869160320266443 0.3411638879161305 15.336561764039793 381.9571985277929 22546.458508173724 183421.97089760954 785277.712024676]
prob = LinearSolve.LinearProblem(transpose(S), transpose(A))
sol = LinearSolve.solve(prob)
# ERROR: MethodError: no method matching ktypeof(::Transpose{Float64, Matrix{Float64}})

ChrisRackauckas · August 9, 2025, 8:24pm

What are your knowns and unknowns?

freeman · August 9, 2025, 8:40pm

I know A and S and I want to find K. KS = A

stevengj · August 9, 2025, 11:34pm

Please give a minimal runnable example of this. e.g. a simple function with a loop of solves using SMatrix that allocates.

Topic		Replies	Views
Solve linear system repeatedly without allocation General Usage question , linearalgebra , memory-allocation	11	1213	July 9, 2022
Solve a system of linear equations many times without allocating memory Numerics	18	2432	April 7, 2020
Non-allocating matrix inversion Performance matrices	13	3577	April 23, 2022
DifferentiationInterface+ForwardDiff unable to diff ldiv! (any other options to solve linear system w/o allocations and differentiate?) Performance question	34	243	July 28, 2025
Solving a linear system without modifying and allocating arrays Performance	3	130	March 25, 2025

Solving linear system without allocating

Related topics