I want to solve **Ax = b** where A is a sparse matrix of size 10^5 x 10^5 (around 18 non-zero entries per row), **x** and **b** are vectors. **A** is constant throughout my simulation, while **b** changes at every step in my loop, and I have recalculate **x** once **b** changes.

I use LU decomposition for **A** and pass this to ldiv!. It takes around 0.05 seconds for the ldiv! step. Strangely (or maybe I am missing something) when I change the values in BLAS.set_num_threads I see no difference in the timings reported by @time macro.

I want to parallelize this in hopes of getting a faster solve time. Using CUDA.jl I am unable to use the backslash operator as it does not work for SparseMatrixCSR. The only thing I found was CUSOLVER.csrlsvqr! which takes a staggering 4seconds, which I profiled and noticed that 49% of time is spent on â€śvoid csqr_leftLookingâ€ť and 47% on â€śCUDA memcpy HtoDâ€ť. Please note that I had already copied **(A,x,b)** to GPU via CuArray constructs and fed these to CUSOLVER.csrlsvqr!, so the HtoD memcpy is confusing to me.

My question is thus:

How do I best solve a sparse linear system (on a CPU, or a GPU, whichever is faster for sizes upwards of 10^5 x 10^5) where the factors must be computed ONLY ONCE.

Thanks in advance!

1 Like

Hi, welcome @pushkar_khandare

For // CPU, you may give a try to:

Did not try myself but you may also use

http://juliasmoothoptimizers.github.io/Krylov.jl/dev/gpu/

on GPUs.

I would be interested by your experiments

http://linearsolve.sciml.ai/dev/

You can run the whole gambit of methods, along with the documented preconditioners. Whether `KrylovJL_GMRES()`

, `KLUFactorization()`

, `UMFPACKFactorization()`

, the GPU default, `MKLPardisoIterate()`

are best is dependent on the sparsity pattern, so itâ€™s hard to guess from this information.

I believe that UMFPACK (facto and solve) is not parallel though.

It uses the BLAS3 kernels, so itâ€™ll multithread if you have the right sparsity pattern.

1 Like

Thank you for your reply!

Unfortunately when I applied for an academic license to use Pardiso I got an email stating that academic licences are not being given out starting January 2022.

Havenâ€™t been able to use Pardiso.jl yet.

1 Like

Hello Chris,

Thank you for your help!

I am attaching the sparsity pattern of my matrix

This is for a smaller matrix. The first few and last few rows have diagonal entries only, and the rest come from the fact that I am solving a FEM problem on a square mesh.

I do not know how to see this sparsity pattern and find an efficient preconditioner, or even make an educated guess of which methods might work.

Any pointers are appreciated! Meanwhile I will go through all solvers from the link you shared.

The matrices in my code will be of the order 10^5 x 10^5, with the sparsity pattern above.

You donâ€™t need it for MKLPardiso.

MWE with LinearSolve.jl:

```
using LinearSolve, LinearSolvePardiso, SparseArrays, Random
A = sparse([1.0 0 -2 3
0 5 1 2
-2 1 4 -7
3 2 -7 5])
b = rand(4)
prob = LinearProblem(A, b)
u = solve(prob, MKLPardisoFactorize(); cache_kwargs...).u
```

should work and automatically install any binaries needed.

This has a regular sparsity pattern, so it should be a case where UMFPACK-type algorithms do well because they should be able to exploit the parallelism of the BLAS3 kernel. Iâ€™d try:

```
using LinearSolve, LinearSolvePardiso, SparseArrays
A = # your matrix
b = # test b
prob = LinearProblem(A, b)
for alg in (
MKLPardisoFactorize(),
MKLPardisoIterate(),
UMFPACKFactorization(),
KLUFactorization())
u = solve(prob, alg).u
end
using AlgebraicMultigrid
ml = ruge_stuben(A) # Construct a Ruge-Stuben solver
pl = aspreconditioner(ml)
solve(prob1, KrylovJL_GMRES(), Pl = pl).u
using IncompleteLU
pl = ilu(A, Ď„ = 0.1) # Ď„ needs to be tuned per problem
solve(prob1, KrylovJL_GMRES(), Pl = pl).u
```

I should make that into a tutorial. You can see this in action in Solving Large Stiff Equations Â· DifferentialEquations.jl

6 Likes

Thanks for such a comprehensive answer Chris!

I will go through the steps mentioned and post what I find.

1 Like

Hello Chris,

After running through the examples you gave, and following the excellent caching interface of LinearSolve.jl, I found out that KLU factorisation gives me the faster solve times than UMFPACK (though the factorisation times are quite high for KLU, but since its a one-time operation it does not matter in my case).

I have a key question. When I change the BLAS.set_num_threads value I see no difference in solve times. How can I vary the number of threads that these algorithms use, and choose the optimal number of threads? I want to run my code on my institute cluster, and want to use all resources I can to solve Ax=b in the fastest time possible.

1 Like

I donâ€™t see a method to change the number of threads used for KLU in the User Guide. Iâ€™ll investigate that further, although it may be a compile time option. Off the top of my head Iâ€™m not sure which parts are parallelized.

Have you tried setting `OMP_NUM_THREADS`

environment variable?