I want to solve Ax = b where A is a sparse matrix of size 10^5 x 10^5 (around 18 non-zero entries per row), x and b are vectors. A is constant throughout my simulation, while b changes at every step in my loop, and I have recalculate x once b changes.
I use LU decomposition for A and pass this to ldiv!. It takes around 0.05 seconds for the ldiv! step. Strangely (or maybe I am missing something) when I change the values in BLAS.set_num_threads I see no difference in the timings reported by @time macro.
I want to parallelize this in hopes of getting a faster solve time. Using CUDA.jl I am unable to use the backslash operator as it does not work for SparseMatrixCSR. The only thing I found was CUSOLVER.csrlsvqr! which takes a staggering 4seconds, which I profiled and noticed that 49% of time is spent on “void csqr_leftLooking” and 47% on “CUDA memcpy HtoD”. Please note that I had already copied (A,x,b) to GPU via CuArray constructs and fed these to CUSOLVER.csrlsvqr!, so the HtoD memcpy is confusing to me.
My question is thus:
How do I best solve a sparse linear system (on a CPU, or a GPU, whichever is faster for sizes upwards of 10^5 x 10^5) where the factors must be computed ONLY ONCE.
Thanks in advance!
Hi, welcome @pushkar_khandare
For // CPU, you may give a try to:
Did not try myself but you may also use
http://juliasmoothoptimizers.github.io/Krylov.jl/dev/gpu/
on GPUs.
I would be interested by your experiments 
http://linearsolve.sciml.ai/dev/
You can run the whole gambit of methods, along with the documented preconditioners. Whether KrylovJL_GMRES(), KLUFactorization(), UMFPACKFactorization(), the GPU default, MKLPardisoIterate() are best is dependent on the sparsity pattern, so it’s hard to guess from this information.
I believe that UMFPACK (facto and solve) is not parallel though.
It uses the BLAS3 kernels, so it’ll multithread if you have the right sparsity pattern.
Thank you for your reply!
Unfortunately when I applied for an academic license to use Pardiso I got an email stating that academic licences are not being given out starting January 2022.
Haven’t been able to use Pardiso.jl yet.
Hello Chris,
Thank you for your help!
I am attaching the sparsity pattern of my matrix
This is for a smaller matrix. The first few and last few rows have diagonal entries only, and the rest come from the fact that I am solving a FEM problem on a square mesh.
I do not know how to see this sparsity pattern and find an efficient preconditioner, or even make an educated guess of which methods might work.
Any pointers are appreciated! Meanwhile I will go through all solvers from the link you shared.
The matrices in my code will be of the order 10^5 x 10^5, with the sparsity pattern above.
You don’t need it for MKLPardiso.
MWE with LinearSolve.jl:
using LinearSolve, LinearSolvePardiso, SparseArrays, Random
A = sparse([1.0 0 -2 3
0 5 1 2
-2 1 4 -7
3 2 -7 5])
b = rand(4)
prob = LinearProblem(A, b)
u = solve(prob, MKLPardisoFactorize(); cache_kwargs...).u
should work and automatically install any binaries needed.
This has a regular sparsity pattern, so it should be a case where UMFPACK-type algorithms do well because they should be able to exploit the parallelism of the BLAS3 kernel. I’d try:
using LinearSolve, LinearSolvePardiso, SparseArrays
A = # your matrix
b = # test b
prob = LinearProblem(A, b)
for alg in (
MKLPardisoFactorize(),
MKLPardisoIterate(),
UMFPACKFactorization(),
KLUFactorization())
u = solve(prob, alg).u
end
using AlgebraicMultigrid
ml = ruge_stuben(A) # Construct a Ruge-Stuben solver
pl = aspreconditioner(ml)
solve(prob1, KrylovJL_GMRES(), Pl = pl).u
using IncompleteLU
pl = ilu(A, Ď„ = 0.1) # Ď„ needs to be tuned per problem
solve(prob1, KrylovJL_GMRES(), Pl = pl).u
I should make that into a tutorial. You can see this in action in https://diffeq.sciml.ai/stable/tutorials/advanced_ode_example/
Thanks for such a comprehensive answer Chris!
I will go through the steps mentioned and post what I find.
Hello Chris,
After running through the examples you gave, and following the excellent caching interface of LinearSolve.jl, I found out that KLU factorisation gives me the faster solve times than UMFPACK (though the factorisation times are quite high for KLU, but since its a one-time operation it does not matter in my case).
I have a key question. When I change the BLAS.set_num_threads value I see no difference in solve times. How can I vary the number of threads that these algorithms use, and choose the optimal number of threads? I want to run my code on my institute cluster, and want to use all resources I can to solve Ax=b in the fastest time possible.
I don’t see a method to change the number of threads used for KLU in the User Guide. I’ll investigate that further, although it may be a compile time option. Off the top of my head I’m not sure which parts are parallelized.
Have you tried setting OMP_NUM_THREADS environment variable?