Perhaps it’s worth adding to this thread that if your matrix is symmetric/hermitian positive definite, you can compute an in-place inverse with zero (pre-)allocations through a Cholesky decomposition. The undocumented LinearAlgebra.inv!(cholesky!(A))
gets you most of the way there, but to get every last ounce of performance you need to use the LAPACK API as shown below (in both cases you get a substantial speedup compared to the naive out-of-place inv(A)
just from the fact that you’re using a Cholesky decomposition).
Here’s how you do this through the LAPACK API. Note that this also works for CuArray
s because CUDA.jl
adds CUSOLVER
-wrapping methods to LAPACK.potr{f,s,i}!
.
using LinearAlgebra
using LinearAlgebra: BlasReal, BlasFloat
function inv!(A::Union{Symmetric{<:BlasReal},Hermitian{<:BlasFloat}})
_, info = LAPACK.potrf!(A.uplo, A.data)
(info == 0) || throw(PosDefException(info))
LAPACK.potri!(A.uplo, A.data)
return A
end
Note that this function relies on A
being <: Union{Symmetric,Hermitian}
such that only one triangle of A.data
, indicated by A.uplo
, is considered data both for input and output. The LAPACK functions will use the other triangle as workspace and leave garbage there upon returning. If you need A.data
to contain a valid inverse at the end you need to copy data over from the active triangle. This is what LinearAlgebra.inv!(cholesky!(A))
does for you and is the reason it takes a little more time.
Apart from this, LinearAlgebra.inv!(cholesky!(A))
and inv!(A)
are basically equivalent; they make the same LAPACK calls under the hood, and while cholesky!
has a type instability and hence a dynamic lookup and two tiny allocations, the performance hit from this is entirely insignificant (about 0.4 μs when I isolate the difference in benchmarks, compared to >100 μs for computing a 100x100 inverse on my laptop).
Benchmarks at 100x100:
julia> using BenchmarkTools
julia> function randposdef(N)
A = randn(N, N)
return Symmetric(A * A' + I)
end
randposdef (generic function with 1 method)
julia> @btime inv(A) setup=(A = randposdef(100)) evals=1;
177.223 μs (5 allocations: 129.17 KiB)
julia> @btime inv(cholesky!(A)) setup=(A = randposdef(100)) evals=1;
128.572 μs (4 allocations: 78.22 KiB)
julia> @btime LinearAlgebra.inv!(cholesky!(A)) setup=(A = randposdef(100)) evals=1;
121.586 μs (2 allocations: 48 bytes)
julia> @btime inv!(A) setup=(A = randposdef(100)) evals=1;
114.315 μs (0 allocations: 0 bytes)