That seems to have been a mistake on my side. A\b is only inaccurate on the GPU for Float32 (maybe on the CPU for Float32 too. I haven’t tested that).

Sorry, I don’t understand. That command doesn’t work. The random matrices I’m using in these examples are just placeholders to demonstrate my issue. My real matrix is quite different.

cond(rand(1000,1000)) > 20000
Again, I’m using these random matrices as tests that I can easily share. Not sure what that number means.

I looked closely and linsolve is complaining that it didn’t converge, but if it didn’t converge after all that time, then it’s not nearly as good as A\b.

That tells me that you’re likely to lose 4–5 figures in the computation so it would not be a surprise to see norm(A*x - b, Inf) as large as 10^{-12}. If you use the L^2 norm, the you could also scale by the square root of the dimension, get up to 10^{-10}, which is what @BLI got in an post upstream.

This is on the CPU. So GMRES is really fast, but (using default parameters), not nearly as accurate as the others (probably good enough, though) and linsolve is 10x faster than \ with the same accuracy (for this case), which makes sense!
So I tested all of the options on my real matrix (a small one, though) and got this:

All of them were equally accurate (1.5e-3, which is ok). So on a CPU, for my specific problem it looks like GMRES works really well.
So now I have to nag you about the GPU again. linsolve is clearly not running well on the GPU (I still get the warnings) and gmres doesn’t run at all. You said you use them on GPUs. What’s your trick?
(Really appreciate the help!)

Sorry, but that is going way over my head. I looked at my real (smallish) case and on my 1914x1914 matrix I get cond=2.06. What does that tell you? Thanks a lot!

A=CUDA.rand(3,3).+1;
x=CUDA.rand(3);
b=A*x;
x1=linsolve(z->A*z, b)
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/WV76E/src/host/indexing.jl:43

You’re able to run linsolve if you set CUDA.allowscalar(false)?

That condition number is completely benign and a dimension of 2000 is not going to do much damage. Is this the problem where A\b failed to give a decent residual on the GPU in single precision? If so, that is a real mystery. There’s theory that says that should not happen. Does the same thing happen in Float32 on your CPU?

I just tested it and this matrix (cond=2) is fine on the GPU using Float32. So the accuracy problem was strongly linked to the poorly conditioned random matrices I was using for testing. Thanks!

Huh. Odd. There is a high probability that I’m doing something stupid =].
If you at some point get to test on a GPU and see that this:

using KrylovKit
using CUDA
CUDA.allowscalar(false);
A=CUDA.rand(3,3).+1;
x=CUDA.rand(3);
b=A*x;
x1=linsolve(z->A*z, b)

works, then please let me know. Maybe you load some other stuff that is needed or have some CUDA settings that I don’t (or linsolve settings!).
Thanks a lot!

You need to know that KK works for very general vectors, so you have to define this for it to work. I am surprised that it does not work here though. Maybe @juthohaegeman has an idea…

The reason I am surprised is because it used to be required to export the right mul, axpy but not anymore as you can see here. In this tutorial, you are shown how to solve linear equations on GPU (among other things).

Ok, the culprit is mul! at src/linsolve/gmres.jl:11. You can fix it with (with type piracy ?)

using KrylovKit, LinearAlgebra
using CUDA
CUDA.allowscalar(false);
A=CUDA.rand(3,3).+1;
x=CUDA.rand(3);
b=A*x;
LinearAlgebra.mul!(x::CuArray{T,1},y::CuArray{T,1},α) where T = x .= α .* y
julia> x1=linsolve(z->A*z, b, b)
(Float32[0.72570944, 0.23375773, 0.08379406], ConvergenceInfo: one converged value after 3 iterations and 11 applications of the linear map;
norms of residuals are given by (0.0f0,).
)

@rveltz Thanks for all the tips. A lot of it goes over my head, but it helps!
Your hacks work and linsolve now runs on the GPU! It is slightly slower and less reliable than \ though.
On my real problem, \ took 0.07 sec, while linsolve took 0.1 sec, both with good accuracy.
On the big random matrix, linsolve diverged, while \ worked great.
So it seems that \ still beats the other options on the GPU, which is surprising to me, but fine for now.
Thanks again!

Again, I disagree! Iterative methods should work best for A = I + rand(N,N) ./ N with N large enough.

This thing to remember is that you cannot use gmres as \. It is not a black box: you need to know some properties of A (perturbation of identity, preconditioner) for it to work well.

@rveltz I believe you. I’m not versed in GMRES yet, so I am indeed using it as a black box. I might have to revisit this in the near future, but will need to educate myself a bit before conducting more tests.
Thanks!