Solving Sparse Linear Systems fast

I want to solve Ax = b where A is a sparse matrix of size 10^5 x 10^5 (around 18 non-zero entries per row), x and b are vectors. A is constant throughout my simulation, while b changes at every step in my loop, and I have recalculate x once b changes.

I use LU decomposition for A and pass this to ldiv!. It takes around 0.05 seconds for the ldiv! step. Strangely (or maybe I am missing something) when I change the values in BLAS.set_num_threads I see no difference in the timings reported by @time macro.

I want to parallelize this in hopes of getting a faster solve time. Using CUDA.jl I am unable to use the backslash operator as it does not work for SparseMatrixCSR. The only thing I found was CUSOLVER.csrlsvqr! which takes a staggering 4seconds, which I profiled and noticed that 49% of time is spent on “void csqr_leftLooking” and 47% on “CUDA memcpy HtoD”. Please note that I had already copied (A,x,b) to GPU via CuArray constructs and fed these to CUSOLVER.csrlsvqr!, so the HtoD memcpy is confusing to me.

My question is thus:
How do I best solve a sparse linear system (on a CPU, or a GPU, whichever is faster for sizes upwards of 10^5 x 10^5) where the factors must be computed ONLY ONCE.

Thanks in advance!

1 Like

Hi, welcome @pushkar_khandare
For // CPU, you may give a try to:

Did not try myself but you may also use
on GPUs.

I would be interested by your experiments :wink:

You can run the whole gambit of methods, along with the documented preconditioners. Whether KrylovJL_GMRES(), KLUFactorization(), UMFPACKFactorization(), the GPU default, MKLPardisoIterate() are best is dependent on the sparsity pattern, so it’s hard to guess from this information.

I believe that UMFPACK (facto and solve) is not parallel though.

It uses the BLAS3 kernels, so it’ll multithread if you have the right sparsity pattern.

1 Like

Thank you for your reply!
Unfortunately when I applied for an academic license to use Pardiso I got an email stating that academic licences are not being given out starting January 2022.
Haven’t been able to use Pardiso.jl yet.

1 Like

Hello Chris,
Thank you for your help!
I am attaching the sparsity pattern of my matrix

This is for a smaller matrix. The first few and last few rows have diagonal entries only, and the rest come from the fact that I am solving a FEM problem on a square mesh.
I do not know how to see this sparsity pattern and find an efficient preconditioner, or even make an educated guess of which methods might work.
Any pointers are appreciated! Meanwhile I will go through all solvers from the link you shared.
The matrices in my code will be of the order 10^5 x 10^5, with the sparsity pattern above.

You don’t need it for MKLPardiso.

MWE with LinearSolve.jl:

using LinearSolve, LinearSolvePardiso, SparseArrays, Random

A = sparse([1.0 0 -2 3
    0 5 1 2
    -2 1 4 -7
    3 2 -7 5])
b = rand(4)
prob = LinearProblem(A, b)
u = solve(prob, MKLPardisoFactorize(); cache_kwargs...).u

should work and automatically install any binaries needed.

This has a regular sparsity pattern, so it should be a case where UMFPACK-type algorithms do well because they should be able to exploit the parallelism of the BLAS3 kernel. I’d try:

using LinearSolve, LinearSolvePardiso, SparseArrays
A = # your matrix
b = # test b

prob = LinearProblem(A, b)

for alg in (

    u = solve(prob, alg).u

using AlgebraicMultigrid
ml = ruge_stuben(A) # Construct a Ruge-Stuben solver
pl = aspreconditioner(ml)
solve(prob1, KrylovJL_GMRES(), Pl = pl).u

using IncompleteLU
pl = ilu(A, τ = 0.1) # τ needs to be tuned per problem
solve(prob1, KrylovJL_GMRES(), Pl = pl).u

I should make that into a tutorial. You can see this in action in Solving Large Stiff Equations · DifferentialEquations.jl


Thanks for such a comprehensive answer Chris!
I will go through the steps mentioned and post what I find.

1 Like

Hello Chris,

After running through the examples you gave, and following the excellent caching interface of LinearSolve.jl, I found out that KLU factorisation gives me the faster solve times than UMFPACK (though the factorisation times are quite high for KLU, but since its a one-time operation it does not matter in my case).
I have a key question. When I change the BLAS.set_num_threads value I see no difference in solve times. How can I vary the number of threads that these algorithms use, and choose the optimal number of threads? I want to run my code on my institute cluster, and want to use all resources I can to solve Ax=b in the fastest time possible.

1 Like

I don’t see a method to change the number of threads used for KLU in the User Guide. I’ll investigate that further, although it may be a compile time option. Off the top of my head I’m not sure which parts are parallelized.

Have you tried setting OMP_NUM_THREADS environment variable?