Pseudoinverse x Factorization in a repeated linear solve: when to use which?

Assume I want to solve Ax = b for a bunch of different b's, but keeping A fixed. Furthermore, assume A is a tall matrix. The question is, what are the pros and cons of precomputing the pseudoinverse of A and precomputing the QR factorization of A?

In the documentation of LinearSolve.jl, for example, one mentions the problem of repeated solve but the pseudoinverse appers to not be used. In the book Fundamentals of Numerical Computation it is also said that the pseudoinverse β€œis rarely necessary in practice”.

Nonetheless, in the following benchmark, I appear to get faster and nonallocating results with the pseudoinverse approach:

using LinearAlgebra

A = rand(Float32, 10^4, 10^2)
invA = pinv(A)
b = rand(Float32, 10^4)

F = qr(A)

dest1 = Vector{Float32}(undef, 10^2)
dest2 = Vector{Float32}(undef, 10^2)

ldiv!(dest1, F, b)
mul!(dest2, invA, b)

dest1 β‰ˆ dest2 # true

@benchmark ldiv!($dest1, $F, $b)
@benchmark mul!($dest2, $invA, $b)

The output of the benchmark of ldiv! is

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  277.379 ΞΌs …  30.607 ms  β”Š GC (min … max): 0.00% … 98.50%
 Time  (median):     432.062 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   394.850 ΞΌs Β± 311.694 ΞΌs  β”Š GC (mean Β± Οƒ):  0.87% Β±  1.54%

  ▁    β–†                                      β–‡β–ˆβ–…β–               
  β–ˆβ–…β–ƒβ–‚β–ˆβ–ˆβ–…β–ƒβ–ƒβ–‚β–‚β–β–β–β–‚β–β–β–β–β–β–‚β–„β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–„β–„β–ƒβ–ƒβ–‚β–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚ β–ƒ
  277 ΞΌs           Histogram: frequency by time          490 ΞΌs <

 Memory estimate: 39.62 KiB, allocs estimate: 4.

while that of mul! is

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  82.456 ΞΌs … 259.985 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     92.826 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   93.672 ΞΌs Β±   5.320 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

             β–β–‚β–„β–ƒβ–„β–„β–ƒβ–„β–†β–†β–‡β–ˆβ–‡β–ˆβ–…β–„β–ƒβ–                                 
  β–‚β–‚β–β–β–‚β–‚β–ƒβ–ƒβ–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–…β–…β–„β–ƒβ–ƒβ–„β–„β–„β–ƒβ–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚ β–…
  82.5 ΞΌs         Histogram: frequency by time          110 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This result seems to be in contradiction with the general advice, at least those contained in the references that I found. Am I missing something here?

Thanks in advance for any help!

If you have enough right-hand sides, computing and multiplying by the pseudoinverse will be faster. However, you might want to think about how many right-hand sides you have and benchmark your computation of F and invA. Computing the pseudoinverse is slower than computing F. So you might or might not have enough right-hand sides for the pseudoinverse to win overall. For a few right-hand sides, factorization will be faster.

Edit: As pointed out by @stevengj, I had forgotten pinv would apply a threshold here. So this example does not work. Since it became part of the discussion, I’ll leave it in as context for other posts and give another example at the bottom.

Also computing, the pseudoinverse is not backward stable. This is something that will have a notable impact on accuracy relative to the factorization approach if A is ill conditioned and the residual is small. Consider

using LinearAlgebra

A =
  randn(Float32, 10^4, 10^2) *
  Diagonal((x -> Float32(exp(-.08 * x))).(1:(10^2))) *
  randn(Float32, 100, 100)
@show cond(A)
x = ones(Float32, 100)
invA = pinv(A)
b = A * x + 1f-5 * randn(Float32, 10^4)
@show norm(b - A*x)

F = qr(A)

dest1 = Vector{Float32}(undef, 10^2)
dest2 = Vector{Float32}(undef, 10^2)

ldiv!(dest1, F, b)
mul!(dest2, invA, b)

@show norm(x - dest1)
@show norm(x - dest2)

This attempts to determine a known x (all ones) when b=A x + e for some error vector e. A somewhat typical run gave:

cond(A) = 169762.02f0
norm(b - A * x) = 0.0010010158f0
norm(x - dest1) = 0.0022723246f0
norm(x - dest2) = 0.6865536f0

So the forward errors are more than two orders of magnitude larger. If you know your problem is well conditioned, you won’t need to worry about this.

New example:

using LinearAlgebra
m = 10^3
n = 10^2
A =
  randn(m, n) *
  Diagonal((x -> exp(-.2 * x)).(1:n)) *
  randn(n, n)
@show cond(A)
x = ones(n)
invA = pinv(A, rtol=0)
b = A * x

F = qr(A)

x1 = F \ b
x2 = invA * b
x3 = (A'*A) \ A'*b

ldiv!(x1, F, b)
mul!(x2, invA, b)

@show norm(x - x1)/norm(x)
@show norm(x - x2)/norm(x)
@show norm(x - x3)/norm(x)

r1 = A*x1-b
r2 = A*x2-b
r3 = A*x3-b

@show norm(A'*r1)/opnorm(A)^2/norm(x1)
@show norm(A'*r2)/opnorm(A)^2/norm(x2)
@show norm(A'*r3)/opnorm(A)^2/norm(x3)

with output

cond(A) = 3.0096217569712498e10
norm(x - x1) / norm(x) = 2.6298853469329983e-7
norm(x - x2) / norm(x) = 1.1859103049127409e-7
norm(x - x3) / norm(x) = 2.264841964440257
(norm(A' * r1) / opnorm(A) ^ 2) / norm(x1) = 5.231508417485662e-17
(norm(A' * r2) / opnorm(A) ^ 2) / norm(x2) = 1.4412504384824499e-8
(norm(A' * r3) / opnorm(A) ^ 2) / norm(x3) = 3.5726030429737984e-10

I made it a consistent system, switched to double precision, and included normal equations. For the forward accuracy of the solution, factorization and the pseudoinverse are comparable and about what you might hope to get for a consistent problem with this condition number. Normal equations suffers from condition squaring and gives much worse accuracy. The residuals do show a difference between factorization and inversion, where the residual for factorization is much better in the sense of being numerically orthogonal to \mbox{range}(A).

So the situation is analogous to the square nonsingular case where forward accuracy is comparable between factorization and inversion, but backward stable factorization gives better residuals.

1 Like

If you’re going to use QR for this, I would use qr(A, ColumnNorm()) to use column-pivoted QR, in case A happens to be badly conditioned (nearly dependent columns). That’s what A \ b uses by default.

Really? Trefethen & Bau’s Numerical Linear Algebra says the following for solution of least-squares problem via the SVD:

image
image

For full-rank matrices, this only differs from pinv(A, rtol=0) * b in the placement of the parentheses (note that pinv is computed via an SVD in Julia, essentially V*(S\U')), but does that kill the stability? (Maybe? I’m having trouble finding a reference that explicitly discusses this.)

Be careful in comparing the forward errors β€” pinv by default drops singular values smaller than \sigma_1 n \varepsilon, which is effectively a regularization. With your same example, I get:

julia> cond(A)
126379.48f0

julia> norm(Float64.(x) - Float64.(A) \ Float64.(b)) # "exact" double-precision 
0.002385886531237574

julia> norm(x - qr(A) \ b) # QR without column pivoting
0.002170066f0

julia> norm(x - qr(A, ColumnNorm()) \ b) # QR with column pivoting
0.002238404f0

julia> norm(x - pinv(A) * b) # pinv with default drop tolerance
0.29329997f0

julia> norm(x - pinv(A, rtol=0) * b) # pinv without dropping small singular values
0.0028052023f0

julia> norm(x - svd(A) \ b) # using the SVD directly without computing pinv
0.0026305662f0

Of course, if your matrix is really this badly conditioned, you should think about whether you might want to impose some kind of regularization β€” either using a truncated SVD (similar to pinv with a suitable drop tolerance), Tikhonov/ridge regularization, or similar ideas.

1 Like

Yes. I believe that as far as it goes, that specific statement is correct. But you are right that I was neglecting the truncation in my example. I think it is harder to show backward instability impacting forward errors with a simple example than I thought. That’s a little embarrassing.

The fact that explicit (pseudo)inverses are not backward stable is something that applies even in the special case of a square invertible matrix. There’s a nice discussion of this on page 260 of Higham’s Accuracy and Stability of Numerical Algorithms (2nd edition). I don’t have a nice pdf or scan to include, but a typed in quote from that is "Suppose X = A^{-1} is formed exactly, and that the only rounding errors are in forming x = \mbox{fl}(Xb). Then \hat{x} = (X+\Delta X)b where |\Delta X| \leq \gamma_n |X|, by (3.11). So A\hat{x} = A(X+\Delta X)b = (I+A\Delta X) b and the best possible residual bound is

|b-A\hat{x}| \leq \gamma_n |A| |A^{-1}| |b|.

Since…" Here \gamma_n = nu /(1-nu) where u is the unit roundoff.

It goes on to note that that residual bound is not consistent with backward stability if \|x\|_\infty \ll \||A^{-1}| |b|\|_\infty. So even getting an exact inverse and multiplying by b with numerical errors is not going to be backward stable. This is the source of potentially large residuals in ill-conditioned square systems solved using an inverse.

For the section from Trefethen and Bau that you quoted, moving the parenthesis results in an explicit computation of an inverse and this sort of argument applies in the square invertible case. Even forming V*(S\U') exactly isn’t good enough for backward stability once rounding in the matrix vector multiply (represented by \Delta X) is taken into account.

A square system is obviously a simpler special case of the pseudoinverse. For a full rank (but ill conditioned) least squares problem I would have thought that you would be able to observe condition squaring without too much effort. That’s what I was trying for in that example. My reasoning was that the conditioning for a LS problem involves a term proportional to both \kappa^2(A) and \|r\|. So the condition squaring does not impact small residual problems so much if you use a backward stable algorithm. This is the forward accuracy advantage of factorization over the normal equations. The pseudoinverse would not be computed using the normal equations, but the columns of the pseudoinverse are solutions to the LS problem with b=e_k. Even if the given problem has a small residual that damps out condition squaring, the columns of the pseudoinverse will not necessarily correspond to problems with a small residual. So I would expect errors proportional to \kappa^2(A)u to be visible in a solution formed from a linear combination of the columns of the pseudoinverse without any mitigation due to a small residual. In a backward stable solution those error terms would be multiplied by \|r\|. Since I forgot about the automatic truncation, that’s what I thought I was observing. I didn’t look at the results as closely as I should have once they seemed to confirm my expectations.

Possibly there’s a flaw in that (admittedly loose) reasoning or it’s simply harder to come up with a nice example illustrating the effect of instability on forward errors than I thought. I’m not sure which now…

Nevertheless, solving Ax = b by inv(A) * b usually has good forwards accuracy (comparable to using LU factors) and is backwards stable under certain conditions, as argued by Druinsky & Toledo (2012).

Agreed. But I think it’s worth noting that even with good forward accuracy, you do typically have larger relative residuals with the inverse when A is ill conditioned and \|x\| \ll \|A^{-1}\| \|b\|. Whether forward accuracy is sufficient even with larger residuals depends on what you are doing with the solution. I still like to warn people away from explicit inversion unless they really want to worry about the details.

1 Like

Agreed. Explicit matrix inversion (or pseudo-inversion) is only a good idea under relatively rare circumstances, even just from a performance perspective.