# Non-allocating matrix inversion

Is there any way to perform matrix inversion?

Say I have a `Vector` of `Matrix` where I want to perform the inverse, and then save them in the same spot where they came from in the array. I had thought of this:

``````function bucle!(Zd)
@simd for i in eachindex(Zd)
@inbounds Zd[i] = inv(Zd[i])
end
return Zd
end
``````

But it allocates:

``````julia> @btime bucle!(\$Zd)
1.125 ms (4004 allocations: 2.34 MiB)
``````

Thanks!

For a single matrix this could be:

``````julia> function inv!(m)
m .= inv(m)
end
inv! (generic function with 1 method)
``````

But `inv` already allocates:

``````julia> @btime inv(\$m)
838.983 ns (4 allocations: 1.95 KiB)

julia> @btime inv!(\$m)
812.727 ns (4 allocations: 1.95 KiB)
``````

``````julia> function bucle!(Zd)
@simd for i in eachindex(Zd)
@inbounds inv!(Zd[i])
end
return Zd
end

julia> Zd=[ rand(Float64,3,3), rand(Float64,3,3), rand(Float64,3,3) ]

julia> @btime bucle!(\$Zd)
2.511 μs (12 allocations: 5.86 KiB)
``````

If your matrices are small (i.e. probably less than 10x10), then you can use https://github.com/JuliaArrays/StaticArrays.jl to make your original function faster and non-allocating with no changes at all to the function itself:

``````julia> using BenchmarkTools

julia> using StaticArrays

julia> function bucle!(Zd)
@simd for i in eachindex(Zd)
@inbounds Zd[i] = inv(Zd[i])
end
return Zd
end
bucle! (generic function with 1 method)

julia> Zd = [rand(SMatrix{3, 3, Float64}) for _ in 1:3];

julia> @btime bucle!(\$Zd);
25.224 ns (0 allocations: 0 bytes)
``````

On my machine, using `SMatrix` as the element of `Zd` results in a 50X speedup over plain `Matrix` for a 3x3 input (expect this improvement to decrease as the matrices get larger and eventually go away completely for large enough matrices).

8 Likes

Am I missing something or your code is doing the element-wise reciprocal, and not matrix inversion, that you could do allocation-free with

``````julia> bucle!(m) = m .= inv.(m)
bucle! (generic function with 1 method)

julia> @btime bucle!(m) setup = (m = collect(reshape(1.0:100.0, 10, 10)))
368.077 ns (0 allocations: 0 bytes)
10×10 Matrix{Float64}:
1.0       0.0909091  0.047619   0.0322581  0.0243902  0.0196078  0.0163934  0.0140845  0.0123457  0.010989
0.5       0.0833333  0.0454545  0.03125    0.0238095  0.0192308  0.016129   0.0138889  0.0121951  0.0108696
0.333333  0.0769231  0.0434783  0.030303   0.0232558  0.0188679  0.015873   0.0136986  0.0120482  0.0107527
0.25      0.0714286  0.0416667  0.0294118  0.0227273  0.0185185  0.015625   0.0135135  0.0119048  0.0106383
0.2       0.0666667  0.04       0.0285714  0.0222222  0.0181818  0.0153846  0.0133333  0.0117647  0.0105263
0.166667  0.0625     0.0384615  0.0277778  0.0217391  0.0178571  0.0151515  0.0131579  0.0116279  0.0104167
0.142857  0.0588235  0.037037   0.027027   0.0212766  0.0175439  0.0149254  0.012987   0.0114943  0.0103093
0.125     0.0555556  0.0357143  0.0263158  0.0208333  0.0172414  0.0147059  0.0128205  0.0113636  0.0102041
0.111111  0.0526316  0.0344828  0.025641   0.0204082  0.0169492  0.0144928  0.0126582  0.011236   0.010101
0.1       0.05       0.0333333  0.025      0.02       0.0166667  0.0142857  0.0125     0.0111111  0.01
``````

?

Numerically speaking, you never actually want to calculate matrix inverses. The only thing a matrix inverse is good for is solving `Ax=b` for `x`, which can be done more efficiently and stably by QR decomposition and gauss elimination (and in specific cases even better, specialized techniques).

The inverse is only really useful for symbolic reasoning.

If in doubt, do `x = A\b` rather than `x = inv(A)*b`

3 Likes

Expanding that a bit: If you want to solve small linear systems, use StaticArrays.jl. If you want to solve bigger linear systems, look for `factorize!`. Otherwise, it would be good to know more about your application to help you.

1 Like

There is actually an undocumented `LinearAlgebra.inv!` function that acts in-place on an LU factorization, which can be constructed in-place with `lu!`:

``````inv!(A) = LinearAlgebra.inv!(lu!(A))
``````

However, it still allocates some small 1d arrays internally:

``````julia> A = rand(1000,1000);

julia> @btime inv(\$A);
30.062 ms (5 allocations: 8.13 MiB)

julia> @btime inv!(\$A);
28.390 ms (3 allocations: 508.09 KiB)
``````

To completely eliminate all allocations, you need to pre-allocate the low-level workspace vectors and call lower-level LAPACK functions (based on code for `getrf!` and `getri!`).

``````import Base: require_one_based_indexing
import LinearAlgebra: checksquare, chkstride1
import LinearAlgebra.BLAS: @blasfunc, BlasInt
import LinearAlgebra.LAPACK: chkargsok, chklapackerror, liblapack, getrf!, getri!

for (getrf, getri, elty) in ((:dgetrf_,:dgetri_,:Float64), (:sgetrf_,:sgetri_,:Float32), (:zgetrf_,:zgetri_,:ComplexF64), (:cgetrf_,:cgetri_,:ComplexF32))
@eval function getrf!(A::AbstractMatrix{\$elty}, ipiv::AbstractVector{BlasInt})
require_one_based_indexing(A)
chkstride1(A)
chkstride1(ipiv)
m, n = size(A)
lda  = max(1,stride(A, 2))
length(ipiv) ≥ min(m,n) || throw(DimensionMismatch())
info = Ref{BlasInt}()
ccall((@blasfunc(\$getrf), liblapack), Cvoid,
(Ref{BlasInt}, Ref{BlasInt}, Ptr{\$elty},
Ref{BlasInt}, Ptr{BlasInt}, Ptr{BlasInt}),
m, n, A, lda, ipiv, info)
chkargsok(info[])
A, ipiv, info[] #Error code is stored in LU factorization type
end
@eval function getri!(A::AbstractMatrix{\$elty}, ipiv::AbstractVector{BlasInt}, work::AbstractVector{\$elty})
require_one_based_indexing(A, ipiv)
chkstride1(A, ipiv)
n = checksquare(A)
if n != length(ipiv)
throw(DimensionMismatch("ipiv has length \$(length(ipiv)), but needs \$n"))
end
lda = max(1,stride(A, 2))
lwork = BlasInt(length(work))
lwork ≥ n || throw(DimensionMismatch())
info  = Ref{BlasInt}()
ccall((@blasfunc(\$getri), liblapack), Cvoid,
(Ref{BlasInt}, Ptr{\$elty}, Ref{BlasInt}, Ptr{BlasInt},
Ptr{\$elty}, Ref{BlasInt}, Ptr{BlasInt}),
n, A, lda, ipiv, work, lwork, info)
chklapackerror(info[])
A
end
end

struct InvWorkspace{T}
ipiv::Vector{BlasInt}
work::Vector{T}
end

function InvWorkspace(A::AbstractMatrix{T}) where {T}
n = checksquare(A)
ipiv = Vector{BlasInt}(undef, n)
work = Vector{T}(undef, 64*n)
return InvWorkspace{T}(ipiv, work)
end

function inv!(A::AbstractMatrix{T}, W::InvWorkspace{T}) where {T}
getrf!(A, W.ipiv)
return getri!(A, W.ipiv, W.work)
end
``````

which gives

``````julia> W = InvWorkspace(A);

julia> @btime inv!(\$A, \$W);
28.423 ms (0 allocations: 0 bytes)
``````
7 Likes
``````using LinearAlgebra, BenchmarkTools, BenchmarkPlots, StatsPlots
bg = BenchmarkGroup()
bg[:inv] = @benchmarkable A*inv(B)*b setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:backslash] = @benchmarkable A*(B\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:slash] = @benchmarkable (A/B)*b setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:qr] = @benchmarkable A*(qr(B)\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:lu] = @benchmarkable A*(lu(B)\b) setup=(A=rand(100,100);B=rand(100,100);b=rand(100))
bg[:inv_pre] = @benchmarkable A*Q*b setup=(A=rand(100,100);B=rand(100,100);Q=inv(B);b=rand(100))
bg[:qr_pre] = @benchmarkable A*(Q\b) setup=(A=rand(100,100);B=rand(100,100);Q=qr(B);b=rand(100))
bg[:lu_pre] = @benchmarkable A*(Q\b) setup=(A=rand(100,100);B=rand(100,100);Q=lu(B);b=rand(100))
trial = run(bg)
plot(trial; yscale=:log10)
``````

Even ignoring the numerical instability of the inversion process, you can see that `A*(B\b)` is faster than inversion, and that `qr(B)` is faster when used as a precomputed auxiliary.

Edit: Added in `LU` decomposition, because it’s the preferred algorithm.

6 Likes

Yes, the standard caveat applies that you are normally better off not using matrix inverses at all, but rather computing the factorization object (typically LU factorization for solving linear systems).

1 Like

Yep, I think that using `StaticArrays` for my problem will be the wisest. All of my matrices are 2x2 or 4x4.

But I have a question: why do you use `SMatrix` instead of `MMatrix`? Isn’t it mutable, since you are reusing the space?

Thanks for the insight. I am aware of it, but unfortunately there is no way around this. I need sometimes to plot each of the elements of the inverse matrix.

There is no multiplication afterwards, so I get no benefit from that.

Whoa, this is deep level stuff haha. Thanks for the info!

You don’t need to overwrite individual elements of the matrix — you are overwriting the whole matrix with its inverse. So just `SMatrix` is fine. (The temporary intermediate calculations are done in registers or on the stack, so no need to worry about allocations there.)

I think that using StaticArrays for my problem will be the wisest. All of my matrices are 2x2 or 4x4.

Yes, in this case they will be an order of magnitude (at least!) faster than calling out to LAPACK.

5 Likes

Perhaps it’s worth adding to this thread that if your matrix is symmetric/hermitian positive definite, you can compute an in-place inverse with zero (pre-)allocations through a Cholesky decomposition. The undocumented `LinearAlgebra.inv!(cholesky!(A))` gets you most of the way there, but to get every last ounce of performance you need to use the LAPACK API as shown below (in both cases you get a substantial speedup compared to the naive out-of-place `inv(A)` just from the fact that you’re using a Cholesky decomposition).

Here’s how you do this through the LAPACK API. Note that this also works for `CuArray`s because `CUDA.jl` adds `CUSOLVER`-wrapping methods to `LAPACK.potr{f,s,i}!`.

``````using LinearAlgebra
using LinearAlgebra: BlasReal, BlasFloat

function inv!(A::Union{Symmetric{<:BlasReal},Hermitian{<:BlasFloat}})
_, info = LAPACK.potrf!(A.uplo, A.data)
(info == 0) || throw(PosDefException(info))
LAPACK.potri!(A.uplo, A.data)
return A
end
``````

Note that this function relies on `A` being `<: Union{Symmetric,Hermitian}` such that only one triangle of `A.data`, indicated by `A.uplo`, is considered data both for input and output. The LAPACK functions will use the other triangle as workspace and leave garbage there upon returning. If you need `A.data` to contain a valid inverse at the end you need to copy data over from the active triangle. This is what `LinearAlgebra.inv!(cholesky!(A))` does for you and is the reason it takes a little more time.

Apart from this, `LinearAlgebra.inv!(cholesky!(A))` and `inv!(A)` are basically equivalent; they make the same LAPACK calls under the hood, and while `cholesky!` has a type instability and hence a dynamic lookup and two tiny allocations, the performance hit from this is entirely insignificant (about 0.4 μs when I isolate the difference in benchmarks, compared to >100 μs for computing a 100x100 inverse on my laptop).

Benchmarks at 100x100:

``````julia> using BenchmarkTools

julia> function randposdef(N)
A = randn(N, N)
return Symmetric(A * A' + I)
end
randposdef (generic function with 1 method)

julia> @btime inv(A) setup=(A = randposdef(100)) evals=1;
177.223 μs (5 allocations: 129.17 KiB)

julia> @btime inv(cholesky!(A)) setup=(A = randposdef(100)) evals=1;
128.572 μs (4 allocations: 78.22 KiB)

julia> @btime LinearAlgebra.inv!(cholesky!(A)) setup=(A = randposdef(100)) evals=1;
121.586 μs (2 allocations: 48 bytes)

julia> @btime inv!(A) setup=(A = randposdef(100)) evals=1;
114.315 μs (0 allocations: 0 bytes)
``````
4 Likes