MPI/Distributed linear system solver?

Hi, folks.
Is there a well established way to solve a linear system (normal, fully populated, Ax=b) across several nodes with Julia? Using MPI or something.
I looked around and couldn’t find anything.
Thanks!

There is MUMPS.jl if you want to use a sparse-direct solve, and Elemental.jl for dense direct solves. For distributed-memory sparse iterative solves, you could use PartitionedArrays.jl or DistributedArrays.jl with one of the iterative-solver packages (see e.g. this talk from JuliaCon 2022).

2 Likes

Thanks for the info.
The matrix A is not sparse (neither is it symmetric, all positive, etc.). Think A=rand(1000,1000); b=rand(1000);.
My main problem is not distributing A in memory (it fits in my RAM), but the actual computation of x, which can take ~10 seconds and is something I must do ~1000 times. So what I need is a distributed memory solver. So I don’t think using a distributed array with the linear solver packages (KrylovKit, IterativeSolvers) would work. Please correct me if I’m wrong.
I also don’t see a method to do this with Elemental, but that might be my limitations in interpreting the source code, as there’s no documentation on linear solvers and no comments in the code.
Thanks again!

10s per solve for 1000 solves is only like 3 hours of runtime. So you reach that goal faster by using the simple code you have now than by trying to come up with something faster :wink:

Apart from that, I don’t think you can orders of magnitude in speed by using better libraries. If your solves are independent of each other than you can just use trivial parallelization via threads or employing multiple nodes in a cluster to decrease overall runtime.

This is dense linear algebra, ala Elemental.jl for the distributed-memory case, which has a solve! function to solve linear equations, I believe. (Though the documentation is sparse, no pun intended.)

It’s hard to give you more specific advice without knowing more information. How big are your matrices? If they fit in memory, have you tried just enabling multithreading? Do you have many right-hand-sides with the same matrix (in which case you can re-use the factorization), or all different matrices? Are the different solves independent, so that they can be done in parallel?

1 Like

@stevengj The matrix is between 1000x1000 and 50000x50000.
I do use a multithreaded solver (IterativeSolvers.gmres!).
It’s a time-marching scheme, so the linear systems are sequential.
The matrix changes every time. Sometimes, it changes very little. The right hand side also changes. Also sometimes very little. What I do is I initialize x with the previous solution, but maybe I could be doing something better?

I think I got Elemental.solve! to work with MPI on a single machine. Sort of. I have to figure some details out. I’ll try on multiple machines and update this. Thanks!

@abraemer 3 hours that if I could bring down to say 20 minutes by using a few machines, would help me a bit =]

1 Like

In general, I wouldn’t use gmres for dense matrices. It will usually be faster to use a direct factorization.

(Also, IterativeSolvers.gmres! is only multi-threaded insofar as the matrix×vector operations are multi-threaded, I think.)

In this case, one option is to try is to do a dense LU factorization of the matrix, and then use this as a preconditioner for GMRES — if the matrix has changed very little, this should cause GMRES to converge extremely quickly. Every once in a while (once the matrix has changed significantly, as evidenced by GMRES becoming slow — taking > 30 iterations, say) you can re-do the LU factorization.

In general, I would focus on optimizing the linear algebra before you worry about parallelizing.

(1000x1000 is almost certainly too small to benefit from distributed memory. You’re better off with threads on a multi-core machine. You can try MKL.jl to see if it’s a better multi-threaded solve on your machine than OpenBLAS. Don’t forget to specify the number of threads you want to use with the appropriate environment variable.)

3 Likes

@stevengj When the problem is small, I use x=A\b, which works great. For bigger problems, that does not work. I believe this is doing LU factorization (doing a manual LU takes about the same time). Doing LU for large matrices becomes quite expensive and also forces me to store another huge matrix in memory. So I’m not sure it’s worth the effort, but maybe I’ll try it. It seems IterativeSolvers.gmres! has options for a left and a right preconditioner. I think the standard would be to use L+I as the left preconditioner and not use a right preconditioner at all, right? Is this what you’re suggesting?

I also tested other options. Using DistributedArrays along with \ or IterativeSolvers.gmres! straight up does not work, due to methods being missing. Using it with KrylovKit.linsolve failed to reach convergence, for reasons I do not understand (it worked with regular arrays).

In the meantime, I’m also trying to get Elemental to work on multiple nodes, which I’m hoping I’ll manage to do soon.
Thanks!

1 Like

You might consider doing the LU in lower precision and using iterative refinement. The unregistered package at

MultiPreicsionArrays.jl

might help you. However, it’s in a very, very early stage of development and does not explicitly support distributed memory. Using the lower precision LU as a preconditioner is also a sensible idea. The documents with the package talk about this a bit.

There is no reason that L+I would help. Using LU would be a lot better.

1 Like

No.

Suppose that you have matrices A = A_0 + \delta A, for a set of small changes \delta A, and you want to solve Ax = b. First, you compute the LU factorization of A_0, i.e. F = lu(A₀). Then, you solve Ax = b with y -> F \ y as a left (or right) preconditioner.

Equivalent to a left preconditioner, solve A_0^{-1} A x = A_0^{-1} b, where the right-hand side is computed by F \ b, and the linear operator on the left-hand side is computed by x -> F \ (A * x) (or some in-place variant thereof — note that the parentheses here are important!). If \delta A is small, then A_0^{-1} A should be very well conditioned (close to I, in fact), so any iterative method (e.g. GMRES) should converge extremely quickly.

As @ctkelley points out, you can even use e.g. F = lu(Float32.(A₀)), i.e. compute your dense-LU preconditioner in a lower precision to speed it up further.

2 Likes

Sorry, I’m struggling to do what you guys are proposing.

using IterativeSolvers
using LinearAlgebra
N = 100;
A = rand(N,N);
b = rand(N);
x = A\b;

L,U=lu(A);
x1 = ones(N);
gmres!(x1,A,b;Pl=(L*U)\(A*x1));

The last line fails (ERROR: MethodError: no method matching ldiv!(::Vector{Float64}, ::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})), so I’m guessing I’m still misunderstanding things.
Thanks for the patience!

Try

AF=lu(A)
Pl=AF\(A*x1)

The output of lu is a factorization object, so it’s best to use that object directly…

Just tested it. That did not work either. I’ve never used the gmres code from IterativeSolvers and clearly don’t understand it. Here’s iterative refinement.

julia> AF=lu(Float32.(A));
julia> x1=zeros(N);
julia> for it=1:5
       r = b - A*x1;
       d = AF\r;
       x1 += d;
       println(norm(x1-x,Inf))
       end
8.97273e-06
2.17159e-11
7.10543e-15
1.06581e-14
9.54792e-15
2 Likes
using IterativeSolvers
using LinearAlgebra
N = 100;
A = rand(N,N);
b = rand(N);
x = A\b;
x1 = ones(N);
AF=lu(A);
Pl=AF\(A*x1);
gmres!(x1,A,b;Pl=Pl);
ERROR: MethodError: no method matching ldiv!(::Vector{Float64}, ::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})

Same issue. Am I misunderstanding something?

Sure, that converges very fast, but that requires \. Using gmres! saves me a lot of time, so I’m trying to avoid \.

F \ x is roughly as fast as matrix multiplication if F is a factorization object.

1 Like

No, this is wrong. A preconditioner is something you apply at every step — a linear operator — not something you do to a single vector at the beginning.

Try:

using IterativeSolvers, LinearAlgebra
n = 100
A₀ = randn(n, n); b = randn(n);
F₀ = lu(A₀);
A = A₀ + rand(n,n) * 1e-3; # a small change to A₀
x = zeros(n); # initial "guess"
gmres!(x, A, b, Pl = F₀, verbose=true)

which converges extremely quickly, versus gmres!(x, A, b, verbose=true) (no preconditioner) which isn’t converging at all — in this example, restarted GMRES is not even converging (you would need to increase the restart parameter)!

As explained in the preconditioning chapter of the manual, the keyword Pl is something that you do \ with, i.e. it is the factorization object itself.

(Personally, I usually use KrylovKit.jl. In this case you have to implement preconditioning yourself by changing the problem as I explained in my post above. However, I think you need a better understanding of what preconditioning is first.)

4 Likes

Thank you very much! Indeed, I know nothing of preconditioning, as you can tell.
This approach led to some interesting results. I took a real case that I’m running where each gmres! takes about 9 seconds. Then I computed LU (30 sec) and stored that matrix. Using this LU allows me to solve gmres in 1.3 seconds. Then I advanced one or two timesteps and ran gmres! again, without updating LU. It took gmres! about 3.3 seconds for each of those timesteps. So that’s a 3x speedup, which is very good! Using the previous x as the initial guess does not seem to speed things up at all.
I need to test it further, advancing more timesteps to see how often I’d have to update LU (hopefully never), but so far this is quite promising!

1 Like

You might try iterative refinement (IR). I have found that if LU is close to A, then IR is faster than GMRES with LU as a preconditioner. One reason for this is that the management of the Krylov vectors in GMRES goes away for IR,

The small loop I posted is all you should need.

2 Likes

For the case I mentioned above (LU computed 2 timesteps previously), your method and gmres take about the same time for similar accuracy. I’ll still need to see what happens after a few more timesteps.
Thanks!

Here is a link to a post providing a justification and pseudo-code for the complete implementation of the iterative refinement technique, essentially the same as @ctkelley suggested.

2 Likes