Cuda cg + ilu?

Hi all,
I’m looking to use a GPU CG method with an ILU preconditioner. I’m currently using a CPU CG method through IterativeSolvers and ILU through IncompleteLU.jl . The CG method from IterativeSolvers is easily extended to GPUs however, some work is required on IncompleteLU.jl to implement it on a GPU. I think only forward_substitution and backward_substitution methods are required to extend this. I’ve tried using sv2! and ilu02 with little success. Does anyone have any ideas?

do you need the ILU on GPU? You could just pass the decomposition to the GPU, see Complex Ginzburg-Landau 2d · Bifurcation Analysis in Julia

The issue is that the preconditioner requires the ‘ldiv!’ method as per Unless I am mistaken?

The link you sent does look helpful though. Thank you for that

Here is some code:

using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers

val = sprand(200,200,0.05);
A_cpu = val*val'
b_cpu = rand(200)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)                       
  sv2!('N', 'L', 1.0, P, y, 'O') 
  sv2!('N', 'U', 1.0, P, y, 'O')  
  return y

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)                        
  sv2!('N', 'L', 1.0, P, x, 'O') 
  sv2!('N', 'U', 1.0, P, x, 'O')  
  return x

# function LinearAlgebra.ldiv!(_lu::LUperso, rhs::CUDA.CuArray)
# 	_x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
# 	rhs .= vec(_x)
# 	# CUDA.unsafe_free!(_x)
# 	rhs
# end
# function LinearAlgebra.ldiv!(yrhs::CuArray,_lu::LUperso, rhs::CuArray)
# 	copyto!(yrhs,rhs)
# 	_x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
# 	rhs .= vec(_x)
# 	# CUDA.unsafe_free!(_x)
# 	rhs
# end

struct LUgpu
	Ut	# transpose of U in LU decomposition

P = LUperso(LowerTriangular(CuSparseMatrixCSR(I+Precilu.L)), UpperTriangular(CuSparseMatrixCSR(sparse(Precilu.U'))));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)

P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7,maxiter=1000)