Hello,
I’m writing some code that solves A*x=b, where A can be quite big (say 10000x10000, hopefully bigger in the future). A is not sparse (at least for now) and might be poorly conditioned. Might even have multiple solutions.

Using A\b is not working great so far, giving me results with fairly large errors (i.e., A*(A\b) is sometimes 3 times a certain element of x). Directly inverting A (x = inv(A)*b) doesn’t use much more memory, is 3x slower (I expected worse), but very accurate. However, this does not run on the GPU.

People use GMRES in my field for these matrices, but it seems IterativeSolvers.gmres does not run on the GPU either.

Anyone know of a good way to solve these problems that works on CPUs and GPUs? Speed is very important, but accuracy is more. Differences of 0.1% could be ok, 10-300% (which is what I’m getting), not really.
Thanks a lot!

EDIT: TL;DR, A\b is only accurate using Float64 on the GPU, not Float32. As long as there isn’t a simple and robust linear solver that runs on multiple CPUs/GPUs, A\b seems to be the way to go (and I stick to 1 CPU/GPU for now).

code that solves A*x=b, where A can be quite big (say 10000x10000, hopefully bigger in the future)

Given the size of your linear systems, you may think about considering iterative methods and/or matrix free methods? Especially when targeting GPUs, you may run into memory limitation and scalability issues with direct (matrix-based) solvers.

Anyone know of a good way to solve these problems that works on CPUs and GPUs?

You could check out ParallelStencil.jl as it provides an architecture agnostic iterative framework for multi-threaded CPU, GPU and multi-XPU execution, if speed is among your priority.

Check out the miniapps section for examples of real-world applications that are solved with iterative approaches.

Feel free to reach out @luraess or @samo if you are curious about further details.

Thanks, @luraess
I watched your JuliaCon 2019 talk the other day on those methods. Really cool stuff.
I was indeed thinking that iterative methods would be good and as I said, many people use GMRES for the problems I’m working on, but I couldn’t find a CUDA compatible GMRES module out there.
Your examples are all PDE based, so the locality helps parallelization (other than the halo regions). I didn’t see an example that directly translates to solving Ax=b. Do you have any? I’m trying to avoid major implementation work at this stage (if possible).
Thanks again!

Thanks @Ribeiro for your feedback regarding the JuliaCon2019 talk - appreciated !

I couldn’t find a CUDA compatible GMRES module out there

I have not much insight about CUDA-capable GMRES implementations, unfortunately.

I didn’t see an example that directly translates to solving Ax=b. Do you have any?

All systems of PDEs solved within the ParallelStencil.jl miniapps could be expressed as A*x=b, with A being the “coefficient” matrix containing stencil information, b the right-hand-side (usually forcing terms) and x the unknown field. One solution you could try out to iteratively solve A*x=b is to rewrite that system in the residual form Rx=b-A*x. You could then introduce a pseudo-time stepper ∆τ and define ∆x/∆τ=Rx to iterate (x = x + ∆τ*Rx) until you reach the steady state, i.e. until ∆x/∆τ -> 0. At that stage you will have iteratively found a solution to your original system. You can find more infos on that approach in @samo 's JuliaCon2020 presentation.

@luraess What you’re describing sounds like a Jacobi solver. Those are not super stable. Maybe it would be the easiest to implement, though. I’ll watch the video! @ctkelley I played around a bit and seem to see the errors on the GPU when using Float32. Float64 seems ok for my tests so far. On the CPU I always use Float64 and that seems to work. @rveltz Thanks. I just tested KryloKit and it does run on the GPU, but it uses regular indexing, which Julia says is very slow and I can confirm (I think it runs sequentially). IterativeSolvers.gmres did not work at all. See code below

I ran this on a very high end desktop:

N=10000
using CUDA
A=rand(N,N); x=rand(N); b=A*x;
A_d=CUDA.rand(N,N); x_d=CUDA.rand(N); b_d=A_d*x_d;
@time x1=A\b; isapprox(x1,x) # 1.5 sec (this runs automatically on many cores), very accurate
@time x1_d=A_d\b_d; isapprox(x1_d,x_d) # 0.3 sec, large errors, or 2.7 sec and very accurate if I use Float64
using KrylovKit
@time x2=linsolve(A,b); isapprox(x2[1],x) # 35sec, large errors
x2_d=CUDA.zeros(N);
@time x2_d=linsolve(A_d,b_d); isapprox(x2_d[1],x_d) # 267sec, large errors
using IterativeSolvers
@time x3=gmres(A,b); isapprox(x3,x) # 126 sec, large errors
x3_d=CUDA.zeros(N);
@time x3_d=gmres(A_d,b_d); isapprox(x3_d,x_d) # didn't work at all

I’ll have to do some more testing, but maybe GPU with Float64 and A\b is the way to go for now.
Any more tips would be appreciated. As @luraess said, a good iterative solver would likely be preferable in the long term (especially since running on multiple GPUs is one of the things I’d like to test and I don’t think I’ll be able to do A\b on multiple GPUs =]).
Thanks!

┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/WV76E/src/host/indexing.jl:43

Here’s a summary of time/accuracy. I ran these many times to make sure compilation was not an issue (pretty sure I did it last time too). N is 10x smaller so that these tests didn’t take too long. All Float64.

Yeah, I’m getting decent times with that too. The main problem was the errors, but further testing indicates that was truncation issues on the GPU. It would be nice to have an option other than \ that runs on multiple CPU/GPU nodes. But maybe that’s too much for now.
Hard to say how fast is fast enough. Long term I’ll be running A\b thousands of times.
Yep, A is always square.
Thanks!

What you’re describing sounds like a Jacobi solver. Those are not super stable. Maybe it would be the easiest to implement, though. I’ll watch the video!

Sure @Ribeiro , have a look at @samo presentation and let me know in case. The approach - somewhat similar to Jacobi - is called pseudo-transient (or dynamic relaxation). Then, you can accelerate significantly the convergence by adding damping-like procedure (referred to as convergence accelerators in the presentation) - worth it
Cheers !

Two problems with GPUs (not that I’m an expert, though…):

limited RAM, shared among many cores

GPUs default to Float32, which may give inaccuracy results

“Cheap” GPUs (like on gaming PCs, costing up to some US$ 500+) implement Float64 in software, which is very slow compared to Float32 (is it 32 times slower??). Computational “GPUs” like the (old?) Nvidia Tesla architecture (used to cost some US$ 3000 and upwards) do Float64 in a much more efficient way, but is still some 4 times slower than Float32. I think…

change A .= I .+ A ./N and everything should work, debugging will be easier. Now, if your matrix is really rand(N,N), i’d rather do the math than the computations…

It seems very peculiar that inv(A) * b is more accurate than A \ b. Error analysis suggests that the opposite should hold. But then again, sometimes the stars align just right and all the rounding errors go in the same direction. Does this hold also for a random b (with the same A)? Or does it only work with that specific choice of a RHS?