In my experiment,I want to solve the equation Ax=b on GPU to accelerate the process.I have tried the iterative solver from Kryvol.jl and cuSOLVER,but the GPU is still slower than CPU.Is that normal?

```
using CUDA
using SparseArrays
using LinearAlgebra
using Krylov
using BenchmarkTools
using JLD2
# Function： Using BiCGStab on CPU to solve the sparse matrix linear equation
function solve_sparse_cpu(A, b)
x = zeros(Float32, size(b))
x=Krylov.bicgstab(A, b)
return x
end
# Function： Using BiCGStab on GPU to solve the sparse matrix linear equation
function solve_sparse_gpu(d_A, d_b)
d_x = CUDA.zeros(Float32, size(d_b))
d_x=Krylov.bicgstab(d_A, d_b, d_x)
return d_x
end
@load "C:/Users/13733/Desktop/DistributionPowerFlow/J.jld2" J
@load "C:/Users/13733/Desktop/DistributionPowerFlow/F.jld2" F
matrix_sizes = size(J,1)
for n in matrix_sizes
println("The scale of matrix A: $n x $n")
A = Float32.(sparse(J))
b = Float32.(-F)
d_A = CUDA.CUSPARSE.CuSparseMatrixCSC(A)
d_b = CuArray(b)
# CPU time consumption
cpu_time = @belapsed solve_sparse_cpu($A, $b)
println("CPU time consumption $cpu_time seconds")
# GPU time consumption
gpu_time = @belapsed solve_sparse_gpu($d_A, $d_b)
println("GPU time consumption: $gpu_time seconds")
println("speed up (CPU / GPU): $(cpu_time / gpu_time)")
println("=====================================")
end
```

The number of non-zero element in matrix J is 23958 and the size of J is 3498*3498.The output result is as follows.

```
The scale of matrix A: 3498 x 3498
CPU time consumption 0.1821423 seconds
GPU time consumption: 1.2135631 seconds
speed up (CPU / GPU): 0.15008885817309375
=====================================
```

I found that in highly sparse matrices, solving on GPU is not as fast as on CPU

1 Like

A is a 3498*3498 sparse matrix

My GPU is NVIDIA GeForce RTX 4060 laptop,and my CPU is i9-12900HX

Hi @automaticvehiclerook!

For this kind of small linear systems, the GPU could be less efficient than CPU.

You could try a few things to improve performances:

- Use the
`CuSparseMatrixCSR`

format, it’s faster if you only need products with `A`

and not its adjoint `A'`

.
- Preallocate the workspace with
`solver = BicgstabSolver(d_A, d_b)`

and call the method in-place with `bicgstab!(solver, d_A, d_b)`

to check if the majority of time spent is not related to allocations.
- Use the package
`KrylovPreconditioners.jl`

to create an operator optimized for multiple products `A * v`

. You just need to pass `op_A = KrylovOperator(d_A)`

to `bicgstab`

or `bicgstab!`

.

I also suggest to use `CUDA.@profile`

to check what is the most expensive operation on GPU for your linear system.

Note that you can also have a faster code on CPU if you use multithreaded matrix-vector products and optimized BLAS.

Just adding `using MKL, MKLSparse`

before calling a Krylov solver can lead to a nice speed-up.

The question is about GPU, but it’s better to have a good baseline on CPU for comparison.

2 Likes

I suggest to also try CUDSS.jl.

It’s an interface to sparse direct methods (LDL’, LL’ or LU) on NVIDIA GPUs.

It’s still a preview library of NVIDIA but it works well.

It should be more efficient than Krylov.jl for small problems.

4 Likes

Thank you for your reply！In fact, what I want to solve is a large-scale linear equation.I tested a 17036*17036 system and it took about 20 seconds on the CPU and 10 seconds on the GPU.I want to use a preconditioner for Bicgstab but I don’t know how to do it on GPU.

I have to adjust the precision to Float64 to meet my engineering requirement

```
using CUDA
using SparseArrays
using LinearAlgebra
using Krylov
using BenchmarkTools
using JLD2
# Function： Using BiCGStab on CPU to solve the sparse matrix linear equation
function solve_sparse_cpu(A, b)
x = zeros(Float64, size(b))
x=Krylov.bicgstab(A, b)
return x
end
# Function： Using BiCGStab on GPU to solve the sparse matrix linear equation
function solve_sparse_gpu(d_A, d_b)
d_x = CUDA.zeros(Float64, size(d_b))
d_x=Krylov.bicgstab(d_A, d_b, d_x)
return d_x
end
@load "C:/Users/13733/Desktop/DistributionPowerFlow/J9241" J9241
@load "C:/Users/13733/Desktop/DistributionPowerFlow/F9241" F9241
J=J9241
F=F9241
matrix_sizes = size(J,1)
for n in matrix_sizes
println("The scale of matrix A: $n x $n")
A = Float64.(sparse(J))
b = Float64.(-F)
d_A = CUDA.CUSPARSE.CuSparseMatrixCSC(A)
d_b = CuArray(b)
# CPU time consumption
cpu_time = @belapsed solve_sparse_cpu($A, $b)
println("CPU time consumption $cpu_time seconds")
# GPU time consumption
gpu_time = @belapsed solve_sparse_gpu($d_A, $d_b)
println("GPU time consumption: $gpu_time seconds")
println("speed up (CPU / GPU): $(cpu_time / gpu_time)")
println("=====================================")
end
```

I give a few examples in the documentation of Krylov.jl:

On GPU, I suggest the following preconditioners:

- Jacobi / block-Jacobi
- Incomplete Cholesky with zero fill-in – IC(0)
- Incomplete LU with zero fill-in – ILU(0)

All of them are available in KrylovPreconditioners.jl.

You can find some examples here of how to use `kp_ic0`

, `kp_il0`

and `BlockJacobiPreconditioner`

on GPUs.

3 Likes

Thank you very much for your help. I will try it