Solving Projection onto Linear Equality for Multiple Input Vectors

RoyiAvital · October 25, 2025, 9:55am

I am after solving the projection onto a Linear Equation problem:

\arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} \; \text{ subject to } \boldsymbol{A} \boldsymbol{x} = \boldsymbol{b}

with \boldsymbol{A} \in \mathbb{R}^{m \times n} with m \ll n.

The closed form solution is:

\begin{align*} \boldsymbol{x} & = \boldsymbol{y} - \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \left( \boldsymbol{A} \boldsymbol{y} - \boldsymbol{b} \right) \\ & = \boldsymbol{y} - \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{A} \boldsymbol{y} + \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{b} \end{align*}

I wonder what the optimal pre processing needed to solve the above repeatedly for many cases of \boldsymbol{y} (The Matrix \boldsymbol{A} and vector \boldsymbol{b} are the same).

Specifically, how would you handle for the cases:

\boldsymbol{A} is a dense matrix.
\boldsymbol{A} is a sparse matrix.

RoyiAvital · October 25, 2025, 11:50am

For the case \boldsymbol{A} is dense, a simple solution is using the SVD of the matrix:

\begin{align*} \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{A} & = \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} {\left( \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} \right)}^{-1} \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \\ & = \boldsymbol{V} \boldsymbol{V}^{\top} \end{align*}

\begin{align*} \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} & = \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} {\left( \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} \right)}^{-1} \\ & = \boldsymbol{V} \boldsymbol{S}^{\dagger} \boldsymbol{U}^{\top} \end{align*}

mtanneau · October 28, 2025, 12:16am

if A is dense, then computing the inverse explicitly might not be the worst thing to do (especially if you’re solving many systems).

QR factorization, inverse, Cholesky factorization, LU factorization, SVD… all them will have O(n^3) complexity. Then each matrix-vector product will cost you O(n^2) because they’re all dense.

When A is sparse, my best guess would be to compute a Cholesky or LDL factorization of AA^T, and re-use that across multiple solves. Definitely avoid computing the inverse as it will likely be dense.

Finally, if you’re solving a lot of systems and have access to a GPU, check out CUDSS, it might be relevant for your case.

cvanaret · October 31, 2025, 9:46am

[disclaimer: linear algebra is not my strong suit]

When A is sparse, what about forming and computing an indefinite factorization of an augmented system, instead of factorizing A A^T?

RoyiAvital · October 31, 2025, 11:18am

Care to explain which formation are you referring to?

stevengj · October 31, 2025, 12:20pm

Usually an LQ factorization, or equivalently a QR factorization of A^T, will be faster.

Realize that A^T (AA^T)^{-1} is just the pseudo-inverse computing the minimum-norm solution, which is what A \ or factorization \ computes for a wide matrix. So, given F = qr(A, ColumnNorm()) or F = lq(A) or F = svd(A) or F = qr(A')', you can simply do:

x = y - F \ (A*y - b)

and it will do the right thing without squaring the condition number. In contrast, a Cholesky factorization of A^T A to explicitly compute (A^T A)^{-1} squares the condition number (which doubles the number of digits you lose to roundoff errors). Precomputing the pseudo-inverse pinv(A) (= V S^+ U^T) is also numerically unstable. So these are things you should only consider if you know that A is well-conditioned.

Unfortunately, for the sparse case, we currently have no left-divide function for qr(A')' to get a minimum-norm solution without squaring the condition number, but the issue for this (SparseArrays.jl#656) includes a sample myldiv function that should work. So you can just do F = qr(A')' and then call x = y - myldiv(F, A*y - b).

For both the sparse and dense cases, using F = cholesky(Hermitian(A * A')) and then calling x = y - A' * (F \ (A*y - b)) is probably faster, but is numerically unstable so it shouldn’t be used unless you know that A is not too badly conditioned.

Of course, in the sparse case, whether a sparse-direct solver works at all depends on the size and the sparsity pattern. For very large sparse matrices that have an unfavorable sparsity pattern, you may run out of memory with a direct solver like the ones above and have to switch to iterative methods, which are another kettle of fish entirely.

PS. If you know all of your y vectors at once, consider putting them as columns of a matrix Y and using X = Y - F \ (A*Y .- b) to get a matrix X of solutions. Block operations like this are generally faster than operating on vectors one by one.

cvanaret · October 31, 2025, 1:41pm

I meant writing the optimality conditions as a KKT system:

\begin{pmatrix} I & A^T \\ A & 0 \end{pmatrix} \begin{pmatrix} x \\ \lambda \end{pmatrix} = \begin{pmatrix} y \\ b \end{pmatrix},

where \lambda is the vector of Lagrange multipliers.

RoyiAvital · October 31, 2025, 2:38pm

I thought so. Yet I don’t see how it assists for multiple cases.
It seems one need to have a decomposition of a different matrix (Though symmetric).

Maybe now one can use LU Decomposition which is faster?

stevengj · October 31, 2025, 5:28pm

Doesn’t seem like it; the fact that the KKT matrix is a lot bigger hurts you, even for the sparse case (and for the dense case it is awful). Here is a small 100 \times 10000 example:

using SparseArrays, LinearAlgebra, BenchmarkTools
A = sprand(100,10000, 1e-3, randn) # about 10 nonzeros per row
KKT = Hermitian([I A'; A 0I])
C = @btime cholesky(Hermitian($A*$A')); QR = @btime qr($A'); LDLᵀ = @btime ldlt($KKT);

which gives

  20.666 μs (62 allocations: 64.62 KiB)
  171.000 μs (177 allocations: 1.39 MiB)
  431.709 μs (46 allocations: 2.53 MiB)

(I used ldlt to take advantage of the fact that the KKT matrix is symmetric-indefinite; using plain lu is slower, about 523µs.) Cholesky on AA^T is the fastest, as I noted above, but it is numerically unstable so should only be used if you know A is well-conditioned.

There’s also the time to use these factorizations for a new y:

julia> y = randn(size(A,2)); b = randn(size(A,1));

julia> x_c = @btime $y - $A' * ($C \ ($A*$y - $b));
  19.750 μs (25 allocations: 200.23 KiB)

julia> x_qr = @btime $y - myldiv($QR', $A*$y - $b);
  22.875 μs (30 allocations: 580.20 KiB)

julia> x_kkt = @btime ($LDLᵀ \ [$y; $b])[1:size(A,2)];
  20.708 μs (26 allocations: 800.59 KiB)

julia> x_c ≈ x_qr ≈ x_kkt
true

which is about the same time ±10% for all three (somewhat surprisingly), and all three give approximately the same result x in this well-conditioned example (unsurprisingly).

stevengj · October 31, 2025, 6:07pm

Aside from performance concerns, it’s not obvious to me that this is numerically stable. Empirically, it makes the condition number even worse than squared:

julia> m, n = 10, 1000
(10, 1000)

julia> A = rand(m,n);

julia> cond(A)
5.9187820909142825

julia> cond(A*A') # squares the condition number
35.031981439727694

julia> cond([I A'; A 0I]) # worse than squared?
51.70345942626648

julia> A = qr(randn(m,m)).Q * Diagonal(exp10.(range(0,-6,length=m))) * Matrix(qr(randn(n,n)).Q)[1:m,:]; # condition number 10⁶

julia> cond(A)
999999.9999974446

julia> cond(A*A') # squared
1.0000072522846323e12

julia> cond([I A'; A 0I]) # slightly worse than squared, again!
1.6179965844734807e12

~~I haven’t had a chance to go through the algebra (or find a reference) for why this is the case.~~ I would recommend against directly solving this KKT system in practice: both slower and less accurate than competing methods.

The 2005 paper “Numerical solution of saddle-point problems” by Benzi, Golub, & Liesen, notes that these KKT matrices “can be very poorly conditioned”, but that the special structure of the problem can be exploited to mitigate ill-conditioning. (Of course, that’s effectively what the QR-based algorithm is doing, and indeed that is cited as one possible method in the paper.) Theorem 3.5 of that paper gives bounds on the eigenvalues of the KKT matrix. If the singular values A are \sigma_{\max} \sim 1 and \sigma_{\min} \ll 1 as in the second case above, for example, this analysis shows that the condition number of the KKT matrix indeed scales like the square of the condition number of A.

cvanaret · October 31, 2025, 7:58pm

Really interesting! Thanks for the numerical example and the reference.
Pretty much all the big optimization codes like IPOPT and KNITRO compute least-square Lagrange multipliers using the KKT system. Cool to see that there are more efficient alternatives.

When you write:

cholesky(Hermitian($A*$A'))

is the matrix explicitly formed or is it some kind of symbolic expression?

stevengj · October 31, 2025, 8:52pm

Yes, AA^T is explicitly formed (albeit sparse, if A is sparse) before computing the (sparse) Cholesky factor.

(And the dollar signs are just a BenchmarkTools.jl thing to avoid penalizing global variables.)

Topic		Replies	Views
Computing sparse orthogonal projections Numerics linearalgebra	6	2052	June 10, 2021
Sparse Cholesky of Gram Matrix (SuiteSparse?) Numerics question , linearalgebra	25	2719	April 11, 2020
Solving KKT linear system (which solver for LDLᵀ factorization?) Numerics kkt , ldl	3	1585	March 8, 2022
Use QR factorization efficiently Performance linearalgebra	12	5298	January 27, 2018
How to optimize simple matrix equation solving Performance question , performance , linearalgebra	16	2370	August 28, 2019

Solving Projection onto Linear Equality for Multiple Input Vectors

Related topics