I am trying to solve a A*x=B matrix equation with A being the discretization of a differential operator. A is a CuSparseMatrixCSC{Float64}, B and x are CuArray{Float64,1}.
I write the following line
x .= A \ B
and get the following warning:
Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with allowscalar(false)
, and error:
MethodError: no method matching lu!(::CuSparseMatrixCSC{Float64}, ::Val{true}; check=true)
Closest candidates are:
lu!(!Matched::StridedArray{T, 2}, ::Union{Val{true}, Val{false}}; check) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\lu.jl:79
Stacktrace:
[1] lu(::CuSparseMatrixCSC{Float64}, ::Val{true}; check::Bool) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\lu.jl:273
[2] lu(::CuSparseMatrixCSC{Float64}, ::Val{true}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\lu.jl:272 (repeats 2 times)
[3] (::CuSparseMatrixCSC{Float64}, ::CuArray{Float64,1}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\LinearAlgebra\src\generic.jl:1116
For this test A is a small matrix, but once it works A will be a 5324800×5324800 CuSparseMatrixCSC{Float64} with 47806720 stored entries:. Also, I have to do this operation at each time step (500’000) because B=B(t).
Try using CSR matrix and csrlsvqr!. Here’s an example:
using SparseArrays
using LinearAlgebra
using CUDA
n = 100
a = sprand(n,n,0.2) + sparse(I, n,n)
A = CUSPARSE.CuSparseMatrixCSR(a)
b = CUDA.rand(Float64, n)
x = CuVector{Float64}(undef,n)
tol = 1e-8
CUSOLVER.csrlsvqr!(A, b, x, tol, one(Cint),'O')
After some changes in A it works well.
However it is very slow compare to a naive iterative scheme.
As I have to use CUSOLVER.csrlsvqr!(A, b, x, tol, one(Cint),'O') many times (500’000) I would like to know if I can first compute the QR (or LU) decomposition and then use the solver ?
Or maybe there is a way to compute only once the decomposition of A and to keep it in memory?
I’m not sure how to do sparse factorization on the GPU directly, maybe someone else could help?
If you’d like to use iterative solvers, there are a few packages that implement Krylov-type methods. Also, maybe you could do the factorization on the CPU and copy arrays to GPU, I haven’t tried this so might not work.
Thank you,
So I tried this method, it is working well !
However, I think my matrix is to large to be solved by direct method. There are 80 times more elements in my L matrix than in A, and approximately the same ratio in U. (I’m using an incomplete LU decomposition, with 1e-7 tolerance. Then I have the same precision than a naive Jacobi iterative scheme. As my system is chaotic, I was looking for a better precision.