Solving Projection onto Linear Equality for Multiple Input Vectors

I am after solving the projection onto a Linear Equation problem:

\arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} \; \text{ subject to } \boldsymbol{A} \boldsymbol{x} = \boldsymbol{b}

with \boldsymbol{A} \in \mathbb{R}^{m \times n} with m \ll n.

The closed form solution is:

\begin{align*} \boldsymbol{x} & = \boldsymbol{y} - \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \left( \boldsymbol{A} \boldsymbol{y} - \boldsymbol{b} \right) \\ & = \boldsymbol{y} - \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{A} \boldsymbol{y} + \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{b} \end{align*}

I wonder what the optimal pre processing needed to solve the above repeatedly for many cases of \boldsymbol{y} (The Matrix \boldsymbol{A} and vector \boldsymbol{b} are the same).

Specifically, how would you handle for the cases:

  • \boldsymbol{A} is a dense matrix.
  • \boldsymbol{A} is a sparse matrix.

For the case \boldsymbol{A} is dense, a simple solution is using the SVD of the matrix:

\begin{align*} \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{A} & = \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} {\left( \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} \right)}^{-1} \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \\ & = \boldsymbol{V} \boldsymbol{V}^{\top} \end{align*}
\begin{align*} \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} & = \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} {\left( \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} \right)}^{-1} \\ & = \boldsymbol{V} \boldsymbol{S}^{\dagger} \boldsymbol{U}^{\top} \end{align*}

if A is dense, then computing the inverse explicitly might not be the worst thing to do (especially if you’re solving many systems).

QR factorization, inverse, Cholesky factorization, LU factorization, SVD
 all them will have O(n^3) complexity. Then each matrix-vector product will cost you O(n^2) because they’re all dense.

When A is sparse, my best guess would be to compute a Cholesky or LDL factorization of AA^T, and re-use that across multiple solves. Definitely avoid computing the inverse as it will likely be dense.

Finally, if you’re solving a lot of systems and have access to a GPU, check out CUDSS, it might be relevant for your case.

[disclaimer: linear algebra is not my strong suit]

When A is sparse, what about forming and computing an indefinite factorization of an augmented system, instead of factorizing A A^T?

Care to explain which formation are you referring to?

Usually an LQ factorization, or equivalently a QR factorization of A^T, will be faster.

Realize that A^T (AA^T)^{-1} is just the pseudo-inverse computing the minimum-norm solution, which is what A \ or factorization \ computes for a wide matrix. So, given F = qr(A, ColumnNorm()) or F = lq(A) or F = svd(A) or F = qr(A')', you can simply do:

x = y - F \ (A*y - b)

and it will do the right thing without squaring the condition number. In contrast, a Cholesky factorization of A^T A to explicitly compute (A^T A)^{-1} squares the condition number (which doubles the number of digits you lose to roundoff errors). Precomputing the pseudo-inverse pinv(A) (= V S^+ U^T) is also numerically unstable. So these are things you should only consider if you know that A is well-conditioned.

Unfortunately, for the sparse case, we currently have no left-divide function for qr(A')' to get a minimum-norm solution without squaring the condition number, but the issue for this (SparseArrays.jl#656) includes a sample myldiv function that should work. So you can just do F = qr(A')' and then call x = y - myldiv(F, A*y - b).

For both the sparse and dense cases, using F = cholesky(Hermitian(A * A')) and then calling x = y - A' * (F \ (A*y - b)) is probably faster, but is numerically unstable so it shouldn’t be used unless you know that A is not too badly conditioned.

Of course, in the sparse case, whether a sparse-direct solver works at all depends on the size and the sparsity pattern. For very large sparse matrices that have an unfavorable sparsity pattern, you may run out of memory with a direct solver like the ones above and have to switch to iterative methods, which are another kettle of fish entirely.

PS. If you know all of your y vectors at once, consider putting them as columns of a matrix Y and using X = Y - F \ (A*Y .- b) to get a matrix X of solutions. Block operations like this are generally faster than operating on vectors one by one.

1 Like

I meant writing the optimality conditions as a KKT system:

\begin{pmatrix} I & A^T \\ A & 0 \end{pmatrix} \begin{pmatrix} x \\ \lambda \end{pmatrix} = \begin{pmatrix} y \\ b \end{pmatrix},

where \lambda is the vector of Lagrange multipliers.

I thought so. Yet I don’t see how it assists for multiple cases.
It seems one need to have a decomposition of a different matrix (Though symmetric).

Maybe now one can use LU Decomposition which is faster?

Doesn’t seem like it; the fact that the KKT matrix is a lot bigger hurts you, even for the sparse case (and for the dense case it is awful). Here is a small 100 \times 10000 example:

using SparseArrays, LinearAlgebra, BenchmarkTools
A = sprand(100,10000, 1e-3, randn) # about 10 nonzeros per row
KKT = Hermitian([I A'; A 0I])
C = @btime cholesky(Hermitian($A*$A')); QR = @btime qr($A'); LDLᔀ = @btime ldlt($KKT);

which gives

  20.666 ÎŒs (62 allocations: 64.62 KiB)
  171.000 ÎŒs (177 allocations: 1.39 MiB)
  431.709 ÎŒs (46 allocations: 2.53 MiB)

(I used ldlt to take advantage of the fact that the KKT matrix is symmetric-indefinite; using plain lu is slower, about 523”s.) Cholesky on AA^T is the fastest, as I noted above, but it is numerically unstable so should only be used if you know A is well-conditioned.

There’s also the time to use these factorizations for a new y:

julia> y = randn(size(A,2)); b = randn(size(A,1));

julia> x_c = @btime $y - $A' * ($C \ ($A*$y - $b));
  19.750 ÎŒs (25 allocations: 200.23 KiB)

julia> x_qr = @btime $y - myldiv($QR', $A*$y - $b);
  22.875 ÎŒs (30 allocations: 580.20 KiB)

julia> x_kkt = @btime ($LDLᔀ \ [$y; $b])[1:size(A,2)];
  20.708 ÎŒs (26 allocations: 800.59 KiB)

julia> x_c ≈ x_qr ≈ x_kkt
true

which is about the same time ±10% for all three (somewhat surprisingly), and all three give approximately the same result x in this well-conditioned example (unsurprisingly).

2 Likes

Aside from performance concerns, it’s not obvious to me that this is numerically stable. Empirically, it makes the condition number even worse than squared:

julia> m, n = 10, 1000
(10, 1000)

julia> A = rand(m,n);

julia> cond(A)
5.9187820909142825

julia> cond(A*A') # squares the condition number
35.031981439727694

julia> cond([I A'; A 0I]) # worse than squared?
51.70345942626648

julia> A = qr(randn(m,m)).Q * Diagonal(exp10.(range(0,-6,length=m))) * Matrix(qr(randn(n,n)).Q)[1:m,:]; # condition number 10⁶

julia> cond(A)
999999.9999974446

julia> cond(A*A') # squared
1.0000072522846323e12

julia> cond([I A'; A 0I]) # slightly worse than squared, again!
1.6179965844734807e12

I haven’t had a chance to go through the algebra (or find a reference) for why this is the case. I would recommend against directly solving this KKT system in practice: both slower and less accurate than competing methods.

The 2005 paper “Numerical solution of saddle-point problems” by Benzi, Golub, & Liesen, notes that these KKT matrices “can be very poorly conditioned”, but that the special structure of the problem can be exploited to mitigate ill-conditioning. (Of course, that’s effectively what the QR-based algorithm is doing, and indeed that is cited as one possible method in the paper.) Theorem 3.5 of that paper gives bounds on the eigenvalues of the KKT matrix. If the singular values A are \sigma_{\max} \sim 1 and \sigma_{\min} \ll 1 as in the second case above, for example, this analysis shows that the condition number of the KKT matrix indeed scales like the square of the condition number of A.

3 Likes

Really interesting! Thanks for the numerical example and the reference.
Pretty much all the big optimization codes like IPOPT and KNITRO compute least-square Lagrange multipliers using the KKT system. Cool to see that there are more efficient alternatives.

When you write:

cholesky(Hermitian($A*$A'))

is the matrix explicitly formed or is it some kind of symbolic expression?

Yes, AA^T is explicitly formed (albeit sparse, if A is sparse) before computing the (sparse) Cholesky factor.

(And the dollar signs are just a BenchmarkTools.jl thing to avoid penalizing global variables.)

1 Like