 # How to use the package ParallelStencil.jl to solve linear equation A*x=b in GPU?

For example, I can use the following code to solve linear equation in GPU, but this package Krylov.jl cannot support the complex matrix input, so when solving the complex linear equation, it has a poor efficiency. I found this package ParallelStencil.jl had a high efficiency in GPU calculation. Therefore, can I use this package to solve complex linear equation in GPU?

# Example with a general square system
using CUDA, Krylov, LinearOperators, BenchmarkTools
using CUDA.CUSPARSE, SparseArrays, LinearAlgebra, IncompleteLU

# CPU Arrays
n=2000
A_cpu = sprand(n,n,0.3);
A_cpu = A_cpu'*A_cpu
b_cpu = rand(n)

# GPU Arrays
@time A_gpu = CuSparseMatrixCSC(A_cpu)
@time b_gpu = CuVector(b_cpu)
println("A_gpu=",typeof(A_gpu))
println("b_gpu=",typeof(b_gpu))

# LU ≈ A for CuSparseMatrixCSC matrices
P = ilu02(A_gpu, 'O')

# Solve Py = x
function ldiv!(y, P, x)
copyto!(y, x)                        # Variant for CuSparseMatrixCSR
sv2!('N', 'L', 'N', 1.0, P, y, 'O')  # sv2!('N', 'L', 'U', 1.0, P, y, 'O')
sv2!('N', 'U', 'U', 1.0, P, y, 'O')  # sv2!('N', 'U', 'N', 1.0, P, y, 'O')
return y
end

# Operator that model P⁻¹
n = length(b_gpu)
T = eltype(b_gpu)
println(T)
opM = LinearOperator(T, n, n, false, false, (y, x, α, β) -> ldiv!(y, P, x))
println(typeof(opM))

# Solve an unsymmetric system with an incomplete LU preconditioner on GPU
@time (x, stats) = bicgstab(A_gpu, b_gpu, M=opM)



Welcome You can’t use the CUSOLVER.csrlsvlu! like in the CUDA.jl test here ?

This cannot calculate huge dimensional matrix. And if do so, it will cost huge time compared with cpu direct solver. I test the 10000*10000 random matrix, this GPU solver costs 496s, and the A\b solver costs only 29s.

Hi @liaoweiyang2017 and welcome! ParallelStencil.jl permits the leverage the parallel processing capabilities of GPUs adding support for stencil-based operations in a math-close fashion (ParallelStencil does not expose direct sparse solvers based on (partial) matrix factorisation).

ParallelStencil support Complex numbers so you could implement your linear solve A*x=b using an iterative approach to take advantage of Complex number support and GPU processing capabilities.
One simple and efficient way would be to implement a dynamic relaxation method, rewriting your system as

\frac{\partial A}{\partial \tau}=-Ax+b

and iterating until convergence or steady-state. You can, in most cases, severely accelerate this solution approach by implementing the second order Richardson method as described here and in the presentations listed as links in the ParallelStencil README (e.g. this ref).

Thank you @luraess. I have read the reference you have posted. But I still don’t know how to use your package to solve the linear equation. Can you give a concrete example to solve the linear equation A * x = b ? Such as the following matrix dimension:

using SparseArrays, LinearAlgebra
n=2000
A = sprand(n,n,0.3);
A = A'*A
b = rand(n)


Thanks for getting back to me. Find below a concrete example on how to solve a linear system of equations using ParallelStencil on a Nvidia GPU. Note that the approach works for physics-based PDEs as solving the 2D Laplacian (e.g. steady-state diffusion or smoothing operator) here (appreciate the low iteration count):

# Laplacian 2D
const USE_GPU = true
using ParallelStencil
using ParallelStencil.FiniteDifferences2D
@static if USE_GPU
@init_parallel_stencil(CUDA, Float64, 2)
else
end
using Plots

@parallel function compute_1!(dAdt, A, D, dt, dmp, dx, dy)
return
end

return
end

@views function laplacian2D()
fact    = 40
# Physics
lx, ly  = 10, 10
D       = 1
# Numerics
nx, ny  = fact*50, fact*50
dx, dy  = lx/nx, ly/ny
niter   = 20*nx
dmp     = 2.0/nx
dt      = dx/sqrt(D)/2.1
# Initial conditions
A       = @zeros(nx, ny)
A[2:end-1,2:end-1] .= @rand(nx-2, ny-2)
# display(heatmap(Array(A)', aspect_ratio=1, xlims=(1,nx), ylims=(1,ny))); error("initial condition")
errv = []
# iteration loop
for it = 1:niter
@parallel compute_1!(dAdt, A, D, dt, dmp, dx, dy)
if it % nx == 0
err = maximum(abs.(A)); push!(errv, err)
p1=plot(nx:nx:it,log10.(errv), linewidth=3, markersize=4, markershape=:circle, framestyle=:box, legend=false, xlabel="iter", ylabel="log10(max(|A|))", title="iter=$it") p2=heatmap(Array(A)', aspect_ratio=1, xlims=(1,nx), ylims=(1,ny), title="max(|A|)=$(round(err,sigdigits=3))")
display(plot(p1,p2, dpi=150))
end
end
return
end

@time laplacian2D()


Note that equation could also be solved in the Complex space using ParallelStencil’s Complex number support. Also, in this MWE, A is your x- the solution vector, the @d_xi() and @d_yi operations are your A operator and b=0 here.

NOTE: I am afraid that this approach won’t be able to solve the sparse random matrix with random sparsity pattern you used in your MWE if this choice is motivated by something else than a MWE. One should try though. For which application / field are you requiring the solve of “unstructured” sparse matrices?

Thank you to answer my questions @luraess. My research field is to solve Maxwell equation. We use Finite Volume Method (FVE) or Finite Element Method (FEM) to solve the Maxwell equation. After some calculation, we will get the final form linear equation A * x = b . Because we want to get the result as soon as possible, so we want to use GPU iterative solvers to overcome the calculation bottleneck. Maybe your package cannot meet our needs. Do you know any other julia packages that can stably solve the complex huge sparse linear equation with GPU devices?

KrylovKit.jl, IterativeSolvers.jl…

1 Like

Thanks @liaoweiyang2017 for the further insights. Good to read you are solving Maxwell’s PDEs, which is not similar to your MWE since there, using either explicit or implicit FEM, FV or FD approaches, your coefficient matrix A will not be randomly sparse and have clear defined structure.

ParallelStencil does not expose solvers, but provides you with a framework to efficiently access GPUs; It would still be your task to write down an iterative solution approach to converge your set of equations. The MVE I shared with you could be a way of doing so, as it provides a residual minimisation strategy that scales well both in terms of memory throughput and iteration count – and has Complex support.

Alternatively you should maybe check out the other packages as suggested in this thread to see if they better suit your needs.

1 Like