Cannot solve Ax=B on GPU with A CuSparseMatrixCSC

Hello,

I am trying to solve a A*x=B matrix equation with A being the discretization of a differential operator.
A is a CuSparseMatrixCSC{Float64}, B and x are CuArray{Float64,1}.
I write the following line

x .= A \ B

and get the following warning:

Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with allowscalar(false)

, and error:

MethodError: no method matching lu!(::CuSparseMatrixCSC{Float64}, ::Val{true}; check=true)
Closest candidates are:
lu!(!Matched::StridedArray{T, 2}, ::Union{Val{true}, Val{false}}; check) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\lu.jl:79

Stacktrace:
[1] lu(::CuSparseMatrixCSC{Float64}, ::Val{true}; check::Bool) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\lu.jl:273
[2] lu(::CuSparseMatrixCSC{Float64}, ::Val{true}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\lu.jl:272 (repeats 2 times)
[3] (::CuSparseMatrixCSC{Float64}, ::CuArray{Float64,1}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\generic.jl:1116

For this test A is a small matrix, but once it works A will be a 5324800×5324800 CuSparseMatrixCSC{Float64} with 47806720 stored entries:. Also, I have to do this operation at each time step (500’000) because B=B(t).

Thank you in advance for your help,

Ludo

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

At first glance, I would say the .= is wrong. The dot operator is used for broadcasting which is not necessary to save the resulting vector in x.

Unfortunately, I don’t know much about Cuda. So I can’t help you there.

Try using CSR matrix and csrlsvqr!. Here’s an example:

using SparseArrays
using LinearAlgebra
using CUDA 

n = 100
a = sprand(n,n,0.2) + sparse(I, n,n)
A = CUSPARSE.CuSparseMatrixCSR(a)
b = CUDA.rand(Float64, n)
x = CuVector{Float64}(undef,n)

tol = 1e-8
CUSOLVER.csrlsvqr!(A, b, x, tol, one(Cint),'O')

See https://github.com/JuliaGPU/CUDA.jl/blob/master/test/cusolver/sparse.jl

4 Likes

Thank you,

I tried to apply your advice

and I got

LinearAlgebra.SingularException(29)

I just realized that my matrix A is not invertible :cry:

After some changes in A it works well.
However it is very slow compare to a naive iterative scheme.

As I have to use CUSOLVER.csrlsvqr!(A, b, x, tol, one(Cint),'O') many times (500’000) I would like to know if I can first compute the QR (or LU) decomposition and then use the solver ?
Or maybe there is a way to compute only once the decomposition of A and to keep it in memory?

I’m not sure how to do sparse factorization on the GPU directly, maybe someone else could help?

If you’d like to use iterative solvers, there are a few packages that implement Krylov-type methods. Also, maybe you could do the factorization on the CPU and copy arrays to GPU, I haven’t tried this so might not work.

This is what I do. You have an example here:
https://rveltz.github.io/BifurcationKit.jl/dev/tutorialsCGL/#Continuation-of-periodic-orbits-on-the-GPU-(Advanced)-1

1 Like

Thank you,
So I tried this method, it is working well !

However, I think my matrix is to large to be solved by direct method. There are 80 times more elements in my L matrix than in A, and approximately the same ratio in U. (I’m using an incomplete LU decomposition, with 1e-7 tolerance. Then I have the same precision than a naive Jacobi iterative scheme. As my system is chaotic, I was looking for a better precision.

Thank you, I will have a look.