I want to solve Ax = b where A is a sparse matrix of size 10^5 x 10^5 (around 18 non-zero entries per row), x and b are vectors. A is constant throughout my simulation, while b changes at every step in my loop, and I have recalculate x once b changes.
I use LU decomposition for A and pass this to ldiv!. It takes around 0.05 seconds for the ldiv! step. Strangely (or maybe I am missing something) when I change the values in BLAS.set_num_threads I see no difference in the timings reported by @time macro.
I want to parallelize this in hopes of getting a faster solve time. Using CUDA.jl I am unable to use the backslash operator as it does not work for SparseMatrixCSR. The only thing I found was CUSOLVER.csrlsvqr! which takes a staggering 4seconds, which I profiled and noticed that 49% of time is spent on “void csqr_leftLooking” and 47% on “CUDA memcpy HtoD”. Please note that I had already copied (A,x,b) to GPU via CuArray constructs and fed these to CUSOLVER.csrlsvqr!, so the HtoD memcpy is confusing to me.
My question is thus:
How do I best solve a sparse linear system (on a CPU, or a GPU, whichever is faster for sizes upwards of 10^5 x 10^5) where the factors must be computed ONLY ONCE.
Thanks in advance!
1 Like
Hi, welcome @pushkar_khandare
For // CPU, you may give a try to:
Did not try myself but you may also use
http://juliasmoothoptimizers.github.io/Krylov.jl/dev/gpu/
on GPUs.
I would be interested by your experiments
http://linearsolve.sciml.ai/dev/
You can run the whole gambit of methods, along with the documented preconditioners. Whether KrylovJL_GMRES()
, KLUFactorization()
, UMFPACKFactorization()
, the GPU default, MKLPardisoIterate()
are best is dependent on the sparsity pattern, so it’s hard to guess from this information.
I believe that UMFPACK (facto and solve) is not parallel though.
It uses the BLAS3 kernels, so it’ll multithread if you have the right sparsity pattern.
1 Like
Thank you for your reply!
Unfortunately when I applied for an academic license to use Pardiso I got an email stating that academic licences are not being given out starting January 2022.
Haven’t been able to use Pardiso.jl yet.
1 Like
Hello Chris,
Thank you for your help!
I am attaching the sparsity pattern of my matrix
This is for a smaller matrix. The first few and last few rows have diagonal entries only, and the rest come from the fact that I am solving a FEM problem on a square mesh.
I do not know how to see this sparsity pattern and find an efficient preconditioner, or even make an educated guess of which methods might work.
Any pointers are appreciated! Meanwhile I will go through all solvers from the link you shared.
The matrices in my code will be of the order 10^5 x 10^5, with the sparsity pattern above.
You don’t need it for MKLPardiso.
MWE with LinearSolve.jl:
using LinearSolve, LinearSolvePardiso, SparseArrays, Random
A = sparse([1.0 0 -2 3
0 5 1 2
-2 1 4 -7
3 2 -7 5])
b = rand(4)
prob = LinearProblem(A, b)
u = solve(prob, MKLPardisoFactorize(); cache_kwargs...).u
should work and automatically install any binaries needed.
This has a regular sparsity pattern, so it should be a case where UMFPACK-type algorithms do well because they should be able to exploit the parallelism of the BLAS3 kernel. I’d try:
using LinearSolve, LinearSolvePardiso, SparseArrays
A = # your matrix
b = # test b
prob = LinearProblem(A, b)
for alg in (
MKLPardisoFactorize(),
MKLPardisoIterate(),
UMFPACKFactorization(),
KLUFactorization())
u = solve(prob, alg).u
end
using AlgebraicMultigrid
ml = ruge_stuben(A) # Construct a Ruge-Stuben solver
pl = aspreconditioner(ml)
solve(prob1, KrylovJL_GMRES(), Pl = pl).u
using IncompleteLU
pl = ilu(A, Ď„ = 0.1) # Ď„ needs to be tuned per problem
solve(prob1, KrylovJL_GMRES(), Pl = pl).u
I should make that into a tutorial. You can see this in action in https://diffeq.sciml.ai/stable/tutorials/advanced_ode_example/
6 Likes
Thanks for such a comprehensive answer Chris!
I will go through the steps mentioned and post what I find.
1 Like
Hello Chris,
After running through the examples you gave, and following the excellent caching interface of LinearSolve.jl, I found out that KLU factorisation gives me the faster solve times than UMFPACK (though the factorisation times are quite high for KLU, but since its a one-time operation it does not matter in my case).
I have a key question. When I change the BLAS.set_num_threads value I see no difference in solve times. How can I vary the number of threads that these algorithms use, and choose the optimal number of threads? I want to run my code on my institute cluster, and want to use all resources I can to solve Ax=b in the fastest time possible.
1 Like
I don’t see a method to change the number of threads used for KLU in the User Guide. I’ll investigate that further, although it may be a compile time option. Off the top of my head I’m not sure which parts are parallelized.
Have you tried setting OMP_NUM_THREADS
environment variable?