I am after solving the projection onto a Linear Equation problem:
\arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} \; \text{ subject to } \boldsymbol{A} \boldsymbol{x} = \boldsymbol{b}
with \boldsymbol{A} \in \mathbb{R}^{m \times n} with m \ll n.
The closed form solution is:
\begin{align*}
\boldsymbol{x} & = \boldsymbol{y} - \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \left( \boldsymbol{A} \boldsymbol{y} - \boldsymbol{b} \right) \\
& = \boldsymbol{y} - \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{A} \boldsymbol{y} + \boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{b}
\end{align*}
I wonder what the optimal pre processing needed to solve the above repeatedly for many cases of \boldsymbol{y} (The Matrix \boldsymbol{A} and vector \boldsymbol{b} are the same).
Specifically, how would you handle for the cases:
- \boldsymbol{A} is a dense matrix.
- \boldsymbol{A} is a sparse matrix.
For the case \boldsymbol{A} is dense, a simple solution is using the SVD of the matrix:
\begin{align*}
\boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} \boldsymbol{A} & = \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} {\left( \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} \right)}^{-1} \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \\
& = \boldsymbol{V} \boldsymbol{V}^{\top}
\end{align*}
\begin{align*}
\boldsymbol{A}^{\top} {\left( \boldsymbol{A} \boldsymbol{A}^{\top} \right)}^{-1} & = \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} {\left( \boldsymbol{U} \boldsymbol{S} \boldsymbol{V}^{\top} \boldsymbol{V} \boldsymbol{S} \boldsymbol{U}^{\top} \right)}^{-1} \\
& = \boldsymbol{V} \boldsymbol{S}^{\dagger} \boldsymbol{U}^{\top}
\end{align*}
if A is dense, then computing the inverse explicitly might not be the worst thing to do (especially if you’re solving many systems).
QR factorization, inverse, Cholesky factorization, LU factorization, SVD… all them will have O(n^3) complexity. Then each matrix-vector product will cost you O(n^2) because they’re all dense.
When A is sparse, my best guess would be to compute a Cholesky or LDL factorization of AA^T, and re-use that across multiple solves. Definitely avoid computing the inverse as it will likely be dense.
Finally, if you’re solving a lot of systems and have access to a GPU, check out CUDSS, it might be relevant for your case.
[disclaimer: linear algebra is not my strong suit]
When A is sparse, what about forming and computing an indefinite factorization of an augmented system, instead of factorizing A A^T?
Care to explain which formation are you referring to?
Usually an LQ factorization, or equivalently a QR factorization of A^T, will be faster.
Realize that A^T (AA^T)^{-1} is just the pseudo-inverse computing the minimum-norm solution, which is what A \ or factorization \ computes for a wide matrix. So, given F = qr(A, ColumnNorm()) or F = lq(A) or F = svd(A) or F = qr(A')', you can simply do:
x = y - F \ (A*y - b)
and it will do the right thing without squaring the condition number. In contrast, a Cholesky factorization of A^T A to explicitly compute (A^T A)^{-1} squares the condition number (which doubles the number of digits you lose to roundoff errors). Precomputing the pseudo-inverse pinv(A) (= V S^+ U^T) is also numerically unstable. So these are things you should only consider if you know that A is well-conditioned.
Unfortunately, for the sparse case, we currently have no left-divide function for qr(A')' to get a minimum-norm solution without squaring the condition number, but the issue for this (SparseArrays.jl#656) includes a sample myldiv function that should work. So you can just do F = qr(A')' and then call x = y - myldiv(F, A*y - b).
For both the sparse and dense cases, using F = cholesky(Hermitian(A * A')) and then calling x = y - A' * (F \ (A*y - b)) is probably faster, but is numerically unstable so it shouldn’t be used unless you know that A is not too badly conditioned.
Of course, in the sparse case, whether a sparse-direct solver works at all depends on the size and the sparsity pattern. For very large sparse matrices that have an unfavorable sparsity pattern, you may run out of memory with a direct solver like the ones above and have to switch to iterative methods, which are another kettle of fish entirely.
PS. If you know all of your y vectors at once, consider putting them as columns of a matrix Y and using X = Y - F \ (A*Y .- b) to get a matrix X of solutions. Block operations like this are generally faster than operating on vectors one by one.
1 Like
I meant writing the optimality conditions as a KKT system:
\begin{pmatrix}
I & A^T \\
A & 0
\end{pmatrix}
\begin{pmatrix}
x \\
\lambda
\end{pmatrix}
=
\begin{pmatrix}
y \\
b
\end{pmatrix},
where \lambda is the vector of Lagrange multipliers.
I thought so. Yet I don’t see how it assists for multiple cases.
It seems one need to have a decomposition of a different matrix (Though symmetric).
Maybe now one can use LU Decomposition which is faster?
Doesn’t seem like it; the fact that the KKT matrix is a lot bigger hurts you, even for the sparse case (and for the dense case it is awful). Here is a small 100 \times 10000 example:
using SparseArrays, LinearAlgebra, BenchmarkTools
A = sprand(100,10000, 1e-3, randn) # about 10 nonzeros per row
KKT = Hermitian([I A'; A 0I])
@btime cholesky(Hermitian($A*$A')); @btime qr($A'); @btime ldlt($KKT);
which gives
20.666 μs (62 allocations: 64.62 KiB)
171.000 μs (177 allocations: 1.39 MiB)
431.709 μs (46 allocations: 2.53 MiB)
(I used ldlt to take advantage of the fact that the KKT matrix is symmetric-indefinite; using plain lu is slower, about 523µs.) Cholesky on AA^T is the fastest, as I noted above, but it is numerically unstable so should only be used if you know A is well-conditioned.