Non-allocating matrix inversion

Is there any way to perform matrix inversion?

Say I have a Vector of Matrix where I want to perform the inverse, and then save them in the same spot where they came from in the array. I had thought of this:

function bucle!(Zd)
    @simd for i in eachindex(Zd)
        @inbounds Zd[i] = inv(Zd[i])
    return Zd

But it allocates:

julia> @btime bucle!($Zd)
  1.125 ms (4004 allocations: 2.34 MiB)


For a single matrix this could be:

julia> function inv!(m)
          m .= inv(m)
inv! (generic function with 1 method)

But inv already allocates:

julia> @btime inv($m)
  838.983 ns (4 allocations: 1.95 KiB)

julia> @btime inv!($m)
  812.727 ns (4 allocations: 1.95 KiB)

Your function would be than:

julia> function bucle!(Zd)
           @simd for i in eachindex(Zd)
               @inbounds inv!(Zd[i])
           return Zd

julia> Zd=[ rand(Float64,3,3), rand(Float64,3,3), rand(Float64,3,3) ]

julia> @btime bucle!($Zd)
  2.511 μs (12 allocations: 5.86 KiB)

If your matrices are small (i.e. probably less than 10x10), then you can use to make your original function faster and non-allocating with no changes at all to the function itself:

julia> using BenchmarkTools

julia> using StaticArrays

julia> function bucle!(Zd)
           @simd for i in eachindex(Zd)
               @inbounds Zd[i] = inv(Zd[i])
           return Zd
bucle! (generic function with 1 method)

julia> Zd = [rand(SMatrix{3, 3, Float64}) for _ in 1:3];

julia> @btime bucle!($Zd);
  25.224 ns (0 allocations: 0 bytes)

On my machine, using SMatrix as the element of Zd results in a 50X speedup over plain Matrix for a 3x3 input (expect this improvement to decrease as the matrices get larger and eventually go away completely for large enough matrices).


Am I missing something or your code is doing the element-wise reciprocal, and not matrix inversion, that you could do allocation-free with

julia> bucle!(m) = m .= inv.(m)
bucle! (generic function with 1 method)

julia> @btime bucle!(m) setup = (m = collect(reshape(1.0:100.0, 10, 10)))
  368.077 ns (0 allocations: 0 bytes)
10×10 Matrix{Float64}:
 1.0       0.0909091  0.047619   0.0322581  0.0243902  0.0196078  0.0163934  0.0140845  0.0123457  0.010989
 0.5       0.0833333  0.0454545  0.03125    0.0238095  0.0192308  0.016129   0.0138889  0.0121951  0.0108696
 0.333333  0.0769231  0.0434783  0.030303   0.0232558  0.0188679  0.015873   0.0136986  0.0120482  0.0107527
 0.25      0.0714286  0.0416667  0.0294118  0.0227273  0.0185185  0.015625   0.0135135  0.0119048  0.0106383
 0.2       0.0666667  0.04       0.0285714  0.0222222  0.0181818  0.0153846  0.0133333  0.0117647  0.0105263
 0.166667  0.0625     0.0384615  0.0277778  0.0217391  0.0178571  0.0151515  0.0131579  0.0116279  0.0104167
 0.142857  0.0588235  0.037037   0.027027   0.0212766  0.0175439  0.0149254  0.012987   0.0114943  0.0103093
 0.125     0.0555556  0.0357143  0.0263158  0.0208333  0.0172414  0.0147059  0.0128205  0.0113636  0.0102041
 0.111111  0.0526316  0.0344828  0.025641   0.0204082  0.0169492  0.0144928  0.0126582  0.011236   0.010101
 0.1       0.05       0.0333333  0.025      0.02       0.0166667  0.0142857  0.0125     0.0111111  0.01


Numerically speaking, you never actually want to calculate matrix inverses. The only thing a matrix inverse is good for is solving Ax=b for x, which can be done more efficiently and stably by QR decomposition and gauss elimination (and in specific cases even better, specialized techniques).

The inverse is only really useful for symbolic reasoning.

If in doubt, do x = A\b rather than x = inv(A)*b


Expanding that a bit: If you want to solve small linear systems, use StaticArrays.jl. If you want to solve bigger linear systems, look for factorize!. Otherwise, it would be good to know more about your application to help you.

1 Like

There is actually an undocumented LinearAlgebra.inv! function that acts in-place on an LU factorization, which can be constructed in-place with lu!:

inv!(A) = LinearAlgebra.inv!(lu!(A))

However, it still allocates some small 1d arrays internally:

julia> A = rand(1000,1000);

julia> @btime inv($A);
  30.062 ms (5 allocations: 8.13 MiB)

julia> @btime inv!($A);
  28.390 ms (3 allocations: 508.09 KiB)

To completely eliminate all allocations, you need to pre-allocate the low-level workspace vectors and call lower-level LAPACK functions (based on code for getrf! and getri!).

import Base: require_one_based_indexing
import LinearAlgebra: checksquare, chkstride1
import LinearAlgebra.BLAS: @blasfunc, BlasInt
import LinearAlgebra.LAPACK: chkargsok, chklapackerror, liblapack, getrf!, getri!

for (getrf, getri, elty) in ((:dgetrf_,:dgetri_,:Float64), (:sgetrf_,:sgetri_,:Float32), (:zgetrf_,:zgetri_,:ComplexF64), (:cgetrf_,:cgetri_,:ComplexF32))
    @eval function getrf!(A::AbstractMatrix{$elty}, ipiv::AbstractVector{BlasInt})
        m, n = size(A)
        lda  = max(1,stride(A, 2))
        length(ipiv) ≥ min(m,n) || throw(DimensionMismatch())
        info = Ref{BlasInt}()
        ccall((@blasfunc($getrf), liblapack), Cvoid,
              (Ref{BlasInt}, Ref{BlasInt}, Ptr{$elty},
               Ref{BlasInt}, Ptr{BlasInt}, Ptr{BlasInt}),
              m, n, A, lda, ipiv, info)
        A, ipiv, info[] #Error code is stored in LU factorization type
    @eval function getri!(A::AbstractMatrix{$elty}, ipiv::AbstractVector{BlasInt}, work::AbstractVector{$elty})
        require_one_based_indexing(A, ipiv)
        chkstride1(A, ipiv)
        n = checksquare(A)
        if n != length(ipiv)
            throw(DimensionMismatch("ipiv has length $(length(ipiv)), but needs $n"))
        lda = max(1,stride(A, 2))
        lwork = BlasInt(length(work))
        lwork ≥ n || throw(DimensionMismatch())
        info  = Ref{BlasInt}()
        ccall((@blasfunc($getri), liblapack), Cvoid,
              (Ref{BlasInt}, Ptr{$elty}, Ref{BlasInt}, Ptr{BlasInt},
               Ptr{$elty}, Ref{BlasInt}, Ptr{BlasInt}),
              n, A, lda, ipiv, work, lwork, info)

struct InvWorkspace{T}

function InvWorkspace(A::AbstractMatrix{T}) where {T}
    n = checksquare(A)
    ipiv = Vector{BlasInt}(undef, n)
    work = Vector{T}(undef, 64*n)
    return InvWorkspace{T}(ipiv, work)

function inv!(A::AbstractMatrix{T}, W::InvWorkspace{T}) where {T}
    getrf!(A, W.ipiv)
    return getri!(A, W.ipiv,

which gives

julia> W = InvWorkspace(A);

julia> @btime inv!($A, $W);
  28.423 ms (0 allocations: 0 bytes)
using LinearAlgebra, BenchmarkTools, BenchmarkPlots, StatsPlots
bg = BenchmarkGroup()
bg[:inv] = @benchmarkable A*inv(B)*b setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:backslash] = @benchmarkable A*(B\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:slash] = @benchmarkable (A/B)*b setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:qr] = @benchmarkable A*(qr(B)\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:lu] = @benchmarkable A*(lu(B)\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:inv_pre] = @benchmarkable A*Q*b setup=(A=rand(100,100);B=rand(100,100);Q=inv(B);b=rand(100))
bg[:qr_pre] = @benchmarkable A*(Q\b) setup=(A=rand(100,100);B=rand(100,100);Q=qr(B);b=rand(100))
bg[:lu_pre] = @benchmarkable A*(Q\b) setup=(A=rand(100,100);B=rand(100,100);Q=lu(B);b=rand(100))
trial = run(bg)
plot(trial; yscale=:log10)


Even ignoring the numerical instability of the inversion process, you can see that A*(B\b) is faster than inversion, and that qr(B) is faster when used as a precomputed auxiliary.

Edit: Added in LU decomposition, because it’s the preferred algorithm.


Yes, the standard caveat applies that you are normally better off not using matrix inverses at all, but rather computing the factorization object (typically LU factorization for solving linear systems).

1 Like

Yep, I think that using StaticArrays for my problem will be the wisest. All of my matrices are 2x2 or 4x4.

But I have a question: why do you use SMatrix instead of MMatrix? Isn’t it mutable, since you are reusing the space?

Thanks for the insight. I am aware of it, but unfortunately there is no way around this. I need sometimes to plot each of the elements of the inverse matrix.

There is no multiplication afterwards, so I get no benefit from that. :wink:

Whoa, this is deep level stuff haha. Thanks for the info!

You don’t need to overwrite individual elements of the matrix — you are overwriting the whole matrix with its inverse. So just SMatrix is fine. (The temporary intermediate calculations are done in registers or on the stack, so no need to worry about allocations there.)

I think that using StaticArrays for my problem will be the wisest. All of my matrices are 2x2 or 4x4.

Yes, in this case they will be an order of magnitude (at least!) faster than calling out to LAPACK.


Perhaps it’s worth adding to this thread that if your matrix is symmetric/hermitian positive definite, you can compute an in-place inverse with zero (pre-)allocations through a Cholesky decomposition. The undocumented LinearAlgebra.inv!(cholesky!(A)) gets you most of the way there, but to get every last ounce of performance you need to use the LAPACK API as shown below (in both cases you get a substantial speedup compared to the naive out-of-place inv(A) just from the fact that you’re using a Cholesky decomposition).

Here’s how you do this through the LAPACK API. Note that this also works for CuArrays because CUDA.jl adds CUSOLVER-wrapping methods to LAPACK.potr{f,s,i}!.

using LinearAlgebra
using LinearAlgebra: BlasReal, BlasFloat

function inv!(A::Union{Symmetric{<:BlasReal},Hermitian{<:BlasFloat}})
    _, info = LAPACK.potrf!(A.uplo,
    (info == 0) || throw(PosDefException(info))
    return A

Note that this function relies on A being <: Union{Symmetric,Hermitian} such that only one triangle of, indicated by A.uplo, is considered data both for input and output. The LAPACK functions will use the other triangle as workspace and leave garbage there upon returning. If you need to contain a valid inverse at the end you need to copy data over from the active triangle. This is what LinearAlgebra.inv!(cholesky!(A)) does for you and is the reason it takes a little more time.

Apart from this, LinearAlgebra.inv!(cholesky!(A)) and inv!(A) are basically equivalent; they make the same LAPACK calls under the hood, and while cholesky! has a type instability and hence a dynamic lookup and two tiny allocations, the performance hit from this is entirely insignificant (about 0.4 μs when I isolate the difference in benchmarks, compared to >100 μs for computing a 100x100 inverse on my laptop).

Benchmarks at 100x100:

julia> using BenchmarkTools

julia> function randposdef(N)
           A = randn(N, N)
           return Symmetric(A * A' + I)
randposdef (generic function with 1 method)

julia> @btime inv(A) setup=(A = randposdef(100)) evals=1;
  177.223 μs (5 allocations: 129.17 KiB)

julia> @btime inv(cholesky!(A)) setup=(A = randposdef(100)) evals=1;
  128.572 μs (4 allocations: 78.22 KiB)

julia> @btime LinearAlgebra.inv!(cholesky!(A)) setup=(A = randposdef(100)) evals=1;
  121.586 μs (2 allocations: 48 bytes)

julia> @btime inv!(A) setup=(A = randposdef(100)) evals=1;
  114.315 μs (0 allocations: 0 bytes)