Sparse LU factorization on GPU

IvanBioli · October 30, 2024, 5:58pm

Hi,

I am working with a CuSparseMatrixCSR{Float64, Int32} matrix A on the GPU to solve a linear system multiple times during the training of an ML model. I’m hoping to perform a sparse LU factorization (with optional reordering) of A and reuse it for efficient repeated solves, rather than factoring each time.

I noticed the availability of CUDA.CUSOLVER.SparseCholesky for Cholesky factorizations but couldn’t find an equivalent for LU. Is there a way to perform a sparse LU factorization in this context, or do you know of any workarounds or alternative approaches to achieving this? Any insights or library recommendations would be greatly appreciated.

Thank you!

gdalle · October 30, 2024, 6:10pm

@amontoison this one’s for you

amontoison · October 30, 2024, 6:35pm

What you need is CUDSS.jl.
I provide an example in the README to demonstrate how to reuse the analysis phase, which is the most expensive part as it is performed on the CPU (cuDSS relies on METIS).

We use it in interior point methods (MadNLP.jl) to solve multiple KKT systems with the same sparsity patterns, which makes it very efficient!

IvanBioli · October 30, 2024, 6:50pm

Thanks, that’s what I really needed. I have one question regarding the example provided in the readme. I don’t know in advance how many systems I will solve, and I also don’t know the right-hand-sides in advance. If I do a second call to the solver, as done below:

Are all computations on the GPU?
Can I use the same x_gpu as before? Should I allocate a new one as I did or not?

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra

T = Float64
n = 100
A_cpu = sprand(T, n, n, 0.05) + I
x_cpu = zeros(T, n)
b_cpu = rand(T, n)

A_gpu = CuSparseMatrixCSR(A_cpu)
x_gpu = CuVector(x_cpu)
b_gpu = CuVector(b_cpu)

solver = CudssSolver(A_gpu, "G", 'F')

cudss("analysis", solver, x_gpu, b_gpu)
cudss("factorization", solver, x_gpu, b_gpu)
cudss("solve", solver, x_gpu, b_gpu)

### SECOND CALL HERE
x_gpu_new = zeros(T, n)
b_gpu_new = CuVector(rand(T, n))
cudss("solve", solver, x_gpu_new, b_gpu_new)

amontoison · October 30, 2024, 7:06pm

If you make a second call to the solver, as many operations as possible are performed on the GPU. CUDSS sometimes uses the CPU, but in the same way as the GPU in an asynchronous manner, so it’s more about being faster than anything else.
You can reuse the same storage x_gpu and b_gpu if you want

You can also use this high-level API if you prefer:

IvanBioli · October 30, 2024, 10:51pm

Thanks, the high-level API is very useful. Just one last question: any way to use the same LU factorization to solve systems “from the right”, i.e. something like b/A instead of A\b?

amontoison · October 31, 2024, 12:42am

What is the system that you want to solve with b\A?

I didn’t overloaded \ in CUDSS.jl if the factorization is the argument on the right but I can, if I understand what it’s doing.

If b is a vector, do you want to to solve A'y = b or is it something different?

IvanBioli · October 31, 2024, 6:53am

Yes, that’s correct, I want to solve systems with A', and so use a factorization of the form A'=U'D'L'. No need of overloading, if there is a more verbose way to do it is fine as well.

amontoison · October 31, 2024, 7:30am

The library cuDSS doesn’t support yet the option to solve linear systems with A'.
The option is already available in the API but not functional (CUDSS_CONFIG_SOLVE_MODE in cuDSS Data Types — NVIDIA cuDSS documentation).

It will probably be available in future releases.
If you have an example of application where you need to solve systems with both A and A', I can send an email to the cuDSS team about that.

IvanBioli · October 31, 2024, 8:39am

I have a loss function of the form \ell(x) = \frac{1}{2}\|A^{-1}u(x)\|_2^2, where u:\mathbb{R}^p \to \mathbb{R}^n is some function (in my case, a neural network, but it doesn’t matter much). Then if I want to take the gradient with respect to x of my loss function, it will be \nabla\ell(x) = \mathrm{J}u(x)^{\top} A^{-\top}A^{-1} u(x). This is why I need to solve systems with A^\top as well.

amontoison · October 31, 2024, 4:08pm

Is the product AA^T too expensive to compute or dense in your case?
Otherwise, you can use cholesky on this product.
Solving a system with AA^T is equivalent to do a product with A^{-T} A^{-1} (if A is square and nonsingular), which is what you need in the gradient of your loss function.

IvanBioli · November 2, 2024, 5:09pm

Unfortunately, the product AA^{\top} it is both expensive to compute and dense in my case, plus the conditioning number would square.

amontoison · November 2, 2024, 5:26pm

The only alternative is to perform the two factorizations (A and A^T) in that case with the release 0.3.0 of cuDSS.

Topic		Replies	Views
Batched LU solves (or Factorizations) with Sparse Matrices GPU	6	666	April 15, 2024
Cannot solve Ax=B on GPU with A CuSparseMatrixCSC New to Julia cuda , sparse	9	1103	May 20, 2021
LinearSolve and SparseArrays: reuse factorization for multiple right-hand side vectors? Numerics	2	550	August 29, 2022
Sparse GPU linear solve from documentation fails Numerics	2	137	June 17, 2025
Solving Sparse Linear Systems fast Performance sparse , linearsolve	11	5087	June 23, 2022

Sparse LU factorization on GPU

Related topics