Non-allocating matrix inversion

Is there any way to perform matrix inversion?

Say I have a Vector of Matrix where I want to perform the inverse, and then save them in the same spot where they came from in the array. I had thought of this:

function bucle!(Zd)
    @simd for i in eachindex(Zd)
        @inbounds Zd[i] = inv(Zd[i])
    end
    return Zd
end

But it allocates:

julia> @btime bucle!($Zd)
  1.125 ms (4004 allocations: 2.34 MiB)

Thanks!

For a single matrix this could be:

julia> function inv!(m)
          m .= inv(m)
       end
inv! (generic function with 1 method)

But inv already allocates:

julia> @btime inv($m)
  838.983 ns (4 allocations: 1.95 KiB)

julia> @btime inv!($m)
  812.727 ns (4 allocations: 1.95 KiB)

Your function would be than:

julia> function bucle!(Zd)
           @simd for i in eachindex(Zd)
               @inbounds inv!(Zd[i])
           end
           return Zd
       end

julia> Zd=[ rand(Float64,3,3), rand(Float64,3,3), rand(Float64,3,3) ]

julia> @btime bucle!($Zd)
  2.511 μs (12 allocations: 5.86 KiB)

If your matrices are small (i.e. probably less than 10x10), then you can use GitHub - JuliaArrays/StaticArrays.jl: Statically sized arrays for Julia to make your original function faster and non-allocating with no changes at all to the function itself:

julia> using BenchmarkTools

julia> using StaticArrays

julia> function bucle!(Zd)
           @simd for i in eachindex(Zd)
               @inbounds Zd[i] = inv(Zd[i])
           end
           return Zd
       end
bucle! (generic function with 1 method)

julia> Zd = [rand(SMatrix{3, 3, Float64}) for _ in 1:3];

julia> @btime bucle!($Zd);
  25.224 ns (0 allocations: 0 bytes)

On my machine, using SMatrix as the element of Zd results in a 50X speedup over plain Matrix for a 3x3 input (expect this improvement to decrease as the matrices get larger and eventually go away completely for large enough matrices).

7 Likes

Am I missing something or your code is doing the element-wise reciprocal, and not matrix inversion, that you could do allocation-free with

julia> bucle!(m) = m .= inv.(m)
bucle! (generic function with 1 method)

julia> @btime bucle!(m) setup = (m = collect(reshape(1.0:100.0, 10, 10)))
  368.077 ns (0 allocations: 0 bytes)
10×10 Matrix{Float64}:
 1.0       0.0909091  0.047619   0.0322581  0.0243902  0.0196078  0.0163934  0.0140845  0.0123457  0.010989
 0.5       0.0833333  0.0454545  0.03125    0.0238095  0.0192308  0.016129   0.0138889  0.0121951  0.0108696
 0.333333  0.0769231  0.0434783  0.030303   0.0232558  0.0188679  0.015873   0.0136986  0.0120482  0.0107527
 0.25      0.0714286  0.0416667  0.0294118  0.0227273  0.0185185  0.015625   0.0135135  0.0119048  0.0106383
 0.2       0.0666667  0.04       0.0285714  0.0222222  0.0181818  0.0153846  0.0133333  0.0117647  0.0105263
 0.166667  0.0625     0.0384615  0.0277778  0.0217391  0.0178571  0.0151515  0.0131579  0.0116279  0.0104167
 0.142857  0.0588235  0.037037   0.027027   0.0212766  0.0175439  0.0149254  0.012987   0.0114943  0.0103093
 0.125     0.0555556  0.0357143  0.0263158  0.0208333  0.0172414  0.0147059  0.0128205  0.0113636  0.0102041
 0.111111  0.0526316  0.0344828  0.025641   0.0204082  0.0169492  0.0144928  0.0126582  0.011236   0.010101
 0.1       0.05       0.0333333  0.025      0.02       0.0166667  0.0142857  0.0125     0.0111111  0.01

?

Numerically speaking, you never actually want to calculate matrix inverses. The only thing a matrix inverse is good for is solving Ax=b for x, which can be done more efficiently and stably by QR decomposition and gauss elimination (and in specific cases even better, specialized techniques).

The inverse is only really useful for symbolic reasoning.

If in doubt, do x = A\b rather than x = inv(A)*b

3 Likes

Expanding that a bit: If you want to solve small linear systems, use StaticArrays.jl. If you want to solve bigger linear systems, look for factorize!. Otherwise, it would be good to know more about your application to help you.

1 Like

There is actually an undocumented LinearAlgebra.inv! function that acts in-place on an LU factorization, which can be constructed in-place with lu!:

inv!(A) = LinearAlgebra.inv!(lu!(A))

However, it still allocates some small 1d arrays internally:

julia> A = rand(1000,1000);

julia> @btime inv($A);
  30.062 ms (5 allocations: 8.13 MiB)

julia> @btime inv!($A);
  28.390 ms (3 allocations: 508.09 KiB)

To completely eliminate all allocations, you need to pre-allocate the low-level workspace vectors and call lower-level LAPACK functions (based on code for getrf! and getri!).

import Base: require_one_based_indexing
import LinearAlgebra: checksquare, chkstride1
import LinearAlgebra.BLAS: @blasfunc, BlasInt
import LinearAlgebra.LAPACK: chkargsok, chklapackerror, liblapack, getrf!, getri!

for (getrf, getri, elty) in ((:dgetrf_,:dgetri_,:Float64), (:sgetrf_,:sgetri_,:Float32), (:zgetrf_,:zgetri_,:ComplexF64), (:cgetrf_,:cgetri_,:ComplexF32))
    @eval function getrf!(A::AbstractMatrix{$elty}, ipiv::AbstractVector{BlasInt})
        require_one_based_indexing(A)
        chkstride1(A)
        chkstride1(ipiv)
        m, n = size(A)
        lda  = max(1,stride(A, 2))
        length(ipiv) ≥ min(m,n) || throw(DimensionMismatch())
        info = Ref{BlasInt}()
        ccall((@blasfunc($getrf), liblapack), Cvoid,
              (Ref{BlasInt}, Ref{BlasInt}, Ptr{$elty},
               Ref{BlasInt}, Ptr{BlasInt}, Ptr{BlasInt}),
              m, n, A, lda, ipiv, info)
        chkargsok(info[])
        A, ipiv, info[] #Error code is stored in LU factorization type
    end
    @eval function getri!(A::AbstractMatrix{$elty}, ipiv::AbstractVector{BlasInt}, work::AbstractVector{$elty})
        require_one_based_indexing(A, ipiv)
        chkstride1(A, ipiv)
        n = checksquare(A)
        if n != length(ipiv)
            throw(DimensionMismatch("ipiv has length $(length(ipiv)), but needs $n"))
        end
        lda = max(1,stride(A, 2))
        lwork = BlasInt(length(work))
        lwork ≥ n || throw(DimensionMismatch())
        info  = Ref{BlasInt}()
        ccall((@blasfunc($getri), liblapack), Cvoid,
              (Ref{BlasInt}, Ptr{$elty}, Ref{BlasInt}, Ptr{BlasInt},
               Ptr{$elty}, Ref{BlasInt}, Ptr{BlasInt}),
              n, A, lda, ipiv, work, lwork, info)
        chklapackerror(info[])
        A
    end
end

struct InvWorkspace{T}
    ipiv::Vector{BlasInt}
    work::Vector{T}
end

function InvWorkspace(A::AbstractMatrix{T}) where {T}
    n = checksquare(A)
    ipiv = Vector{BlasInt}(undef, n)
    work = Vector{T}(undef, 64*n)
    return InvWorkspace{T}(ipiv, work)
end

function inv!(A::AbstractMatrix{T}, W::InvWorkspace{T}) where {T}
    getrf!(A, W.ipiv)
    return getri!(A, W.ipiv, W.work)
end

which gives

julia> W = InvWorkspace(A);

julia> @btime inv!($A, $W);
  28.423 ms (0 allocations: 0 bytes)
5 Likes
using LinearAlgebra, BenchmarkTools, BenchmarkPlots, StatsPlots
bg = BenchmarkGroup()
bg[:inv] = @benchmarkable A*inv(B)*b setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:backslash] = @benchmarkable A*(B\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:slash] = @benchmarkable (A/B)*b setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:qr] = @benchmarkable A*(qr(B)\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:lu] = @benchmarkable A*(lu(B)\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:inv_pre] = @benchmarkable A*Q*b setup=(A=rand(100,100);B=rand(100,100);Q=inv(B);b=rand(100))
bg[:qr_pre] = @benchmarkable A*(Q\b) setup=(A=rand(100,100);B=rand(100,100);Q=qr(B);b=rand(100))
bg[:lu_pre] = @benchmarkable A*(Q\b) setup=(A=rand(100,100);B=rand(100,100);Q=lu(B);b=rand(100))
trial = run(bg)
plot(trial; yscale=:log10)

image

Even ignoring the numerical instability of the inversion process, you can see that A*(B\b) is faster than inversion, and that qr(B) is faster when used as a precomputed auxiliary.

Edit: Added in LU decomposition, because it’s the preferred algorithm.

4 Likes

Yes, the standard caveat applies that you are normally better off not using matrix inverses at all, but rather computing the factorization object (typically LU factorization for solving linear systems).

1 Like

Yep, I think that using StaticArrays for my problem will be the wisest. All of my matrices are 2x2 or 4x4.

But I have a question: why do you use SMatrix instead of MMatrix? Isn’t it mutable, since you are reusing the space?

Thanks for the insight. I am aware of it, but unfortunately there is no way around this. I need sometimes to plot each of the elements of the inverse matrix.

There is no multiplication afterwards, so I get no benefit from that. :wink:

Whoa, this is deep level stuff haha. Thanks for the info!

You don’t need to overwrite individual elements of the matrix — you are overwriting the whole matrix with its inverse. So just SMatrix is fine. (The temporary intermediate calculations are done in registers or on the stack, so no need to worry about allocations there.)

I think that using StaticArrays for my problem will be the wisest. All of my matrices are 2x2 or 4x4.

Yes, in this case they will be an order of magnitude (at least!) faster than calling out to LAPACK.

4 Likes