Sparse Cholesky of Gram Matrix (SuiteSparse?)

Hey all! I have a quick question, which I saw was referenced here some time ago.

I was wondering if there was a standard way of factorizing A^TA (i.e. the Gram matrix of A) without actively computing the Gram matrix? This is usually done by the COLAMD procedure from CHOLMOD, but it doesn’t seem that SuiteSparse exposes this to a Julia call (though there are some references to it in spqr.jl found here). Currently, and with only ok performance, I’m using Cholesky on A^TA, but a good chunk of the time is actually spent finding the resulting product :slight_smile:

Is there a standard package that exposes COLAMD for sparse matrices?

Thanks!

3 Likes

Could you use spqr instead of Cholesky? A^TA is just R^TR, and doesn’t require the Gram matrix.

I guess that’s absolutely true! The only part that isn’t great is that it requires two back-solves, and iterations are the main cost; but that’s a pretty good idea for a first pass, thanks!

I guess I wasn’t entirely sure what you needed - CHOLMOD usually computes row and column permutations, but your question was just asking for the factorization of A^TA.

If it works, great. Just wasn’t sure how the two connected.

I guess the question is (which I apologize, as it may not have been clear from above!): I need a sparse factorization of A^TA which would be used in solving a system of the form (A^TA)^{-1}b for many possible vectors b. Sparse Cholesky (via CHOLMOD) on A^TA in the normal mode works since it yields a triangular system which can be used to solve the above problem, but computing A^TA for the desired matrices (which are large, on the order of 10^7 \times 10^7 with .1% fill) is quite expensive.

The CHOLMOD from SuiteSparse can also, alternatively, compute the Sparse Cholesky of A^TA without needing to first compute the Gram matrix A^TA (and it can compute a good permutation with small fill in via COLAMD vs AMD) and this is likely to be both more efficient since the (a) Gram matrix is implicitly computed and (b) likely to result in smaller fill-in since the structure of the matrix is used directly in the factorization. But it’s unclear how to perform a cholesky call from Julia in order to get a factorization of A^TA directly, without computing the product.

In most circumstances, you should try to avoid computing the Gram matrix A^T A at all, as it squares the condition number of A. See also my explanation in another thread on least-squares problems.

I need a sparse factorization of A^T A which would be used in solving a system of the form (A^T A)^{-1}b for many possible vectors b.

Are you solving least-square problems? That is, are you actually solving (A^T A)^{-1} A^T b? In that case what you want is to use the A=QR factorization and solve R^{-1} (Q^* b), which is equivalent but avoids squaring the condition number (or multiplying large matrices together). In fact, the QR factorization object in Julia will do this for you:

QR = qr(A)    # QR factorization (sparse if A is sparse)
x = QR \ b    # least-square solution

Is there some other reason why you want to factorize the Gram matrix?

4 Likes

Heya Prof. Johnson! :slight_smile:

In most circumstances, you should try to avoid computing the Gram matrix A^TA at all, as it squares the condition number of A.

Absolutely, this is what I’m trying to avoid, but at the moment there doesn’t seem to be another way of approaching this. I am solving a proximal operator for a quadratic (and other similar cases) for ADMM, so it requires solving a ‘tilted’ least squares problem (i.e., a sum-of-squares + a linear function), which would require a factorization of the Gram matrix. Essentially, what’s an easy way of computing the solution to

\text{minimize} ~ \|Ax - b\|_2^2 + c^Tx,

for several b and c? (Where x here is the optimization variable, of course.)

As mentioned by @jichan, one possibility is to backsolve with R and R^T (where QR = A), but this is also not great for computational performance as back-solving is the time consuming step in each iteration.

(As a side note: I’m also not worried about losing much precision due to a large condition number, especially for sparse-direct solving, since I’ll be using ADMM to solve the problem to a rather moderate precision.)

No, still no Gram matrix required as I understand it. This is actually exercise 12.15 in Boyd’s VMLS book.

In particular, you can use the identity \Vert Ax - b \Vert^2 + c^Tx = \Vert Ax - b + f\Vert ^2 - \Vert f \Vert^2 + 2f^T b, where f is a solution to the underdetermined system 2A^T f = c. Since the latter terms are a constant, this is equivalent to minimizing the ordinary least-square problem \Vert Ax - b + f\Vert ^2. That is, you can do it in Julia with:

F = qr(A)

# You should really be able to do 
#     f = F' \ 0.5c
# to get a 2A'f = c solution, but this method isn't implemented yet
# for sparse matrices A.

F2 = qr(oftype(A, A'))
f = F2 \ 0.5c

x = F \ (b - f) 

and of course you can re-use the factorizations for additional b and c. (I filed #35421 for the missing qr(A)' \ c method.)

4 Likes

Yes, sure, it is possible to do it with two factorizations; one which initially solves the underdetermined system (e.g., via a Moore-Penrose Pseudoinverse or other least-norm type solution) and one which solves the resulting least squares problem, but just computing the two factorizations would override any performance gains from simply factorizing the Gram matrix and reusing this factorization (even if its numerical accuracy is lower, which doesn’t affect ADMM much due to its generally low-to-modest precision). For context, factorizing the matrix takes on the other of the same time as the entire algorithm takes to converge with the known factorization.

Which brings me back to the original question: as per, e.g., this paper, section 3.2, CHOLMOD should support a method for computing the Cholesky factorization of the Gram matrix of A without needing to directly compute the Gram matrix. Is there any way of accessing this method from Julia?

EDIT: Ah, I see (the post became a little clearer after the edit). Even in this case, though, the dimension of the problem is very close to square, and this method would still require two solutions of the problem, which would also place a large constant in the runtime of the problem (this is similar to reusing the R matrix of the QR factorization of A).

As I noted in my comment, you don’t need two factorizations, you only need one. The QR factorization of A (which you use to solve the least-squares problem) automatically gives you the LQ factorization of A^T (which you can use to solve the underdetermined problem).

Unfortunately, qr(A)' \ c is not implemented in Julia, and I was too lazy to dig through the QR factorization object F = qr(A) to figure out how to use its transpose for the underdetermined problem. (There are some permutation vectors inside F that you need to decipher.) In principle this shouldn’t be too much work, however, which is why I filed missing qr(A)' \ b for sparse · Issue #115 · JuliaSparse/SparseArrays.jl · GitHub.

1 Like

As I noted in my comment, you don’t need two factorizations, you only need one. The QR factorization of A (which you use to solve the least-squares problem) automatically gives you the LQ factorization of AT (which you can use to solve the overdetermined problem).

Oops, forgot to strike out that part of the comment! Yes, you only need one factorization but still require two backsolves (as I noted after the edit); but yes, having that utility would be quite nice generally :slight_smile: . This solution would still roughly double the current runtime, though, and I’m not sure the gains in precision (especially for a low-moderate precision method such as ADMM) would be worth it.

Note that the backsolves are only O(n^2) compared to the factoring which is O(n^3), so 2 backsolves of a QR matrix shouldn’t be a problem.

1 Like

(Using Cholesky factors also requires one backsolve and one forward solve, which is equivalent to two backsolves, no?) It might well be be that Cholesky factorization itself is faster than QR factorization for your sparse matrix, of course.

Yup, this is why I cache the factorization :slight_smile: On the other hand, a backsolve is still very expensive in an inner loop of an algorithm, which is what I’m currently worried about! (I mentioned this above, but these matrices are ~ 10^7\times 10^7, with roughly 10^8 nnz).

Absolutely; my main worry is that solving the problem using the QR factorization requires a multiplication by Q and a backsolve (both done twice, once for each problem, in the inner loop). But I don’t know the performance of sparse QR vs. sparse Cholesky for these kinds of large, sparse, structured matrices, so I guess it’s worth a comparison! I’ll report results once I have them :slight_smile:

Thanks for the help!

@guille, I think what’s not clear is what will you do given what you want.

Namely, let’s say you had access to the function you want, could you describe the rest of the computation?

Sure! So, we want to solve

\text{minimize} ~ \|Ax - b\|_2^2 + c^Tx

for several vectors b, c and fixed A. There are several ways of approaching this; the one I’ve been using is rather simple: expand out the squared norm to get a convex quadratic and minimize it appropriately, i.e., we want the minimizer of

x^T(A^TA)x - 2(A^Tb - c)^Tx,

which is minimized at

x^\star = (A^TA)^{-1}(A^Tb - c),

assuming that A is full column rank (which I enforce). In that case, computing and caching the Cholesky factorization of A^TA lets us solve the problem repeatedly at a relatively modest computational cost (in an inner loop), especially considering that A is quite sparse. Simple code to do this is

using SuiteSparse
A = ...
max_iter = 1000

A_fact = cholesky(A' * A)

for i=1:max_iter
    b, c = ...
    x_sol = A_fact \ (A' * b - c) 
    # Do something with x_sol
end

Of course, computing A' * A is somewhat expensive (even though it’s only done once, but these are somewhat large matrices) and likely destroys some of the useful sparsity patterns available in A. On the other hand, if we could compute the factorization of A' * A in a way that exploits its sparsity already (e.g., via COLAMD) and doesn’t require constructing the Gram matrix, that would be ideal (and we would simply replace A_fact with whatever the new factorization is). Several papers (as in here, section 3.2) indicate that CHOLMOD has this capability and I was wondering if it was exposed within the SuiteSparse bindings for Julia.

@stevengj instead suggested that I compute the sparse QR factorization of A and use that in order to solve the least squares problem. There are two ways of going about that, one is given by @jlchan (which would be to use the fact that R^TR = A^TA) and solve the problem based on that idea. A second possibility (the one suggested by @stevengj) is to use the QR factorization of A directly in order to solve two problems: one which is a least-norm problem for finding a solution to A^Td + c = 0, and one for finding a solution to the (equivalent) least squares problem, \min \|Ax - (b + d)\|_2^2.

Because the matrices are sparse and the procedures are somewhat different, it is unclear which of the two (three?) approaches will be faster. (There are tradeoffs between the matrix fill in with each of the possible factorizations, which could heavily change the performance of the back solves.) As far as I know, Cholesky has fairly small fill-in and good performance even when solving normal-type equations (when compared to QR), so I’m biased towards using that (for example) but, really, we won’t know unless we try this out, as all of this is heavily dependent on the sparsity pattern, among many other things.

2 Likes

I did a little digging, and you can access this functionality from Julia without too much difficulty. In Julia 1.4, do:

using SparseArrays, LinearAlgebra
using SuiteSparse.CHOLMOD

# factorize A*A'
function cholgram(A::Sparse)
    cm = CHOLMOD.defaults(CHOLMOD.common_struct[Threads.threadid()])
    CHOLMOD.set_print_level(cm, 0)

    # Compute the symbolic factorization
    F = CHOLMOD.analyze(A, cm)

    # Compute the numerical factorization
    cholesky!(F, A)

    return F
end
cholgram(A::SparseMatrixCSC) = cholgram(Sparse(A, 0)) # stype == 0

and then cholgram(A) \ b is equivalent to (but much faster than) (A*A') \ b. (Note that it’s AA^T, not A^TA.)

It would be nice to support this as a keyword argument directly in Julia’s standard library. Essentially, all I did was to disable these two lines and be sure to pass 0 for stype in the Sparse constructor, so it would be an easy PR to add a keyword argument for that if someone wants to take it up.

5 Likes

This is lovely, thank you! I will check out the solution once I have access to a laptop and can potentially do a simple PR for it :slight_smile:

Thanks again ! @stevengj

1 Like

So, I tested all solutions, but it’s relatively interesting. In terms of timing, actually computing AA^T and taking the Cholesky factorization of that seems to be roughly the same performance as using the asymmetric mode on CHOLMOD! Code for comparison below.

module TestThisStuff

using LinearAlgebra, SuiteSparse
using SuiteSparse.CHOLMOD
using LightGraphs
using BenchmarkTools

# @stevengj 's code
function cholgram(A::Sparse)
    cm = CHOLMOD.defaults(CHOLMOD.common_struct[Threads.threadid()])
    F = CHOLMOD.analyze(A, cm)

    cholesky!(F, A)

    return F
end
cholgram(A::SparseMatrixCSC) = cholgram(Sparse(A, 0)) # Sparse(A, 0) implies nonsymmetric mode

# Test code
ground_idx = 1
grid_size = 1000

graph = LightGraphs.SimpleGraphs.grid([grid_size, grid_size])

A = [
    incidence_matrix(graph, oriented=true) spzeros(nv(graph), 1)
]
A[ground_idx, end] = 1

@show nnz(A)

@btime cholesky(A * A')
@btime cholgram(A)
@btime qr($(SparseMatrixCSC(A'))) # Necessary to prevent abstract fallback

end

With output:

3996001
  3.457 s (22 allocations: 211.33 MiB)
  3.675 s (2 allocations: 32 bytes)
  14.373 s (1000051 allocations: 1.75 GiB)

Regarding

My guess is that here, when comparing the two methods, the matrix is not being cast as Symmetric which is what is leading to the very slow solution times (doing an @which for the call gives an abstract fallback, which I think is the culprit).

Additionally, the QR factorization is still relatively slow; I haven’t quite compared the performance in the inner loop (for the backsolves using QR), but the scaling of the problem I’m working on is much larger and this would indicate that using a sparse QR factorization may not be a feasible solution.

Anyways, thanks to all who have contributed :slight_smile: !

The optimality conditions of

\min \ \tfrac{1}{2} \|Ax - b\|^2 + c^T x

can also be written

\begin{bmatrix} I & A \\ A^T & 0 \end{bmatrix} \begin{bmatrix} r \\ x \end{bmatrix} = \begin{bmatrix} \phantom{-}b \\ -c \end{bmatrix},

which can be solved using a sparse symmetric indefinite factorization. The system above is larger than the “normal equations” that you wrote but it is typically better conditioned and likely to be sparser. Another advantage is that no Q needs to be stored or applied.

Because it is symmetric, you need only specify one triangle of the matrix. It can be factorized using GitHub - JuliaSmoothOptimizers/HSL.jl: Julia interface to the HSL Mathematical Software Library or GitHub - JuliaSmoothOptimizers/MUMPS.jl: A Julia Interface to MUMPS.

Note that the HSL codes are typically only free for academia. MUMPS is open source and can use MPI if your systems are that large. If you have access to it, I find that HSL_MA57 is often the fastest. You can store the factorization and reuse it to solve for multiple right-hand sides.

If you’re also interested in regularizing your least-squares term in the sense of solving

\min \ \tfrac{1}{2} \|Ax - b\|^2 + \tfrac{1}{2} \|\lambda x\|^2 + c^T x,

the system becomes

\begin{bmatrix} I & A \\ A^T & -\lambda^2 I \end{bmatrix} \begin{bmatrix} r \\ x \end{bmatrix} = \begin{bmatrix} \phantom{-}b \\ -c \end{bmatrix},

and, in addition to HSL.jl and MUMPS.jl, there’s a pure Julia solution: https://github.com/JuliaSmoothOptimizers/LDLFactorizations.jl.

I’d be curious to know how they do on your problem.

1 Like