I am working with a CuSparseMatrixCSR{Float64, Int32} matrix A on the GPU to solve a linear system multiple times during the training of an ML model. I’m hoping to perform a sparse LU factorization (with optional reordering) of A and reuse it for efficient repeated solves, rather than factoring each time.

I noticed the availability of CUDA.CUSOLVER.SparseCholesky for Cholesky factorizations but couldn’t find an equivalent for LU. Is there a way to perform a sparse LU factorization in this context, or do you know of any workarounds or alternative approaches to achieving this? Any insights or library recommendations would be greatly appreciated.

What you need is CUDSS.jl.
I provide an example in the README to demonstrate how to reuse the analysis phase, which is the most expensive part as it is performed on the CPU (cuDSS relies on METIS).

We use it in interior point methods (MadNLP.jl) to solve multiple KKT systems with the same sparsity patterns, which makes it very efficient!

Thanks, that’s what I really needed. I have one question regarding the example provided in the readme. I don’t know in advance how many systems I will solve, and I also don’t know the right-hand-sides in advance. If I do a second call to the solver, as done below:

Are all computations on the GPU?

Can I use the same x_gpu as before? Should I allocate a new one as I did or not?

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra
T = Float64
n = 100
A_cpu = sprand(T, n, n, 0.05) + I
x_cpu = zeros(T, n)
b_cpu = rand(T, n)
A_gpu = CuSparseMatrixCSR(A_cpu)
x_gpu = CuVector(x_cpu)
b_gpu = CuVector(b_cpu)
solver = CudssSolver(A_gpu, "G", 'F')
cudss("analysis", solver, x_gpu, b_gpu)
cudss("factorization", solver, x_gpu, b_gpu)
cudss("solve", solver, x_gpu, b_gpu)
### SECOND CALL HERE
x_gpu_new = zeros(T, n)
b_gpu_new = CuVector(rand(T, n))
cudss("solve", solver, x_gpu_new, b_gpu_new)

If you make a second call to the solver, as many operations as possible are performed on the GPU. CUDSS sometimes uses the CPU, but in the same way as the GPU in an asynchronous manner, so it’s more about being faster than anything else.

You can reuse the same storage x_gpu and b_gpu if you want

You can also use this high-level API if you prefer:

Thanks, the high-level API is very useful. Just one last question: any way to use the same LU factorization to solve systems “from the right”, i.e. something like b/A instead of A\b?

Yes, that’s correct, I want to solve systems with A', and so use a factorization of the form A'=U'D'L'. No need of overloading, if there is a more verbose way to do it is fine as well.

The library cuDSS doesn’t support yet the option to solve linear systems with A'.
The option is already available in the API but not functional (CUDSS_CONFIG_SOLVE_MODE in cuDSS Data Types — NVIDIA cuDSS documentation).

It will probably be available in future releases.
If you have an example of application where you need to solve systems with both A and A', I can send an email to the cuDSS team about that.

I have a loss function of the form \ell(x) = \frac{1}{2}\|A^{-1}u(x)\|_2^2, where u:\mathbb{R}^p \to \mathbb{R}^n is some function (in my case, a neural network, but it doesn’t matter much). Then if I want to take the gradient with respect to x of my loss function, it will be \nabla\ell(x) = \mathrm{J}u(x)^{\top} A^{-\top}A^{-1} u(x). This is why I need to solve systems with A^\top as well.

Is the product AA^T too expensive to compute or dense in your case?
Otherwise, you can use cholesky on this product.
Solving a system with AA^T is equivalent to do a product with A^{-T} A^{-1} (if A is square and nonsingular), which is what you need in the gradient of your loss function.