How to use GPU acceleration to solve linear equation Ax=b

In my experiment,I want to solve the equation Ax=b on GPU to accelerate the process.I have tried the iterative solver from Kryvol.jl and cuSOLVER,but the GPU is still slower than CPU.Is that normal?

using CUDA
using SparseArrays
using LinearAlgebra
using Krylov
using BenchmarkTools
using JLD2
# Function: Using BiCGStab on CPU to solve the sparse matrix linear equation
function solve_sparse_cpu(A, b)
    x = zeros(Float32, size(b))
    x=Krylov.bicgstab(A, b)
    return x
end

# Function: Using BiCGStab on GPU to solve the sparse matrix linear equation
function solve_sparse_gpu(d_A, d_b)
    
    d_x = CUDA.zeros(Float32, size(d_b))  

    d_x=Krylov.bicgstab(d_A, d_b, d_x)

    return d_x  
end

@load "C:/Users/13733/Desktop/DistributionPowerFlow/J.jld2" J
@load "C:/Users/13733/Desktop/DistributionPowerFlow/F.jld2" F
matrix_sizes = size(J,1)

for n in matrix_sizes
    println("The scale of matrix A: $n x $n")

    A = Float32.(sparse(J))  
    b = Float32.(-F)
    d_A = CUDA.CUSPARSE.CuSparseMatrixCSC(A) 
    d_b = CuArray(b)            

    # CPU time consumption
    cpu_time = @belapsed solve_sparse_cpu($A, $b)
    println("CPU time consumption $cpu_time seconds")

    # GPU time consumption
    gpu_time = @belapsed solve_sparse_gpu($d_A, $d_b)
    println("GPU time consumption: $gpu_time seconds")

    println("speed up (CPU / GPU): $(cpu_time / gpu_time)")

    println("=====================================")
end

The number of non-zero element in matrix J is 23958 and the size of J is 3498*3498.The output result is as follows.

The scale of matrix A: 3498 x 3498
CPU time consumption 0.1821423 seconds
GPU time consumption: 1.2135631 seconds
speed up (CPU / GPU): 0.15008885817309375
=====================================

I found that in highly sparse matrices, solving on GPU is not as fast as on CPU

1 Like

How big is your system?

1 Like

A is a 3498*3498 sparse matrix

My GPU is NVIDIA GeForce RTX 4060 laptop,and my CPU is i9-12900HX

Hi @automaticvehiclerook!

For this kind of small linear systems, the GPU could be less efficient than CPU.
You could try a few things to improve performances:

  • Use the CuSparseMatrixCSR format, it’s faster if you only need products with A and not its adjoint A'.
  • Preallocate the workspace with solver = BicgstabSolver(d_A, d_b) and call the method in-place with bicgstab!(solver, d_A, d_b) to check if the majority of time spent is not related to allocations.
  • Use the package KrylovPreconditioners.jl to create an operator optimized for multiple products A * v. You just need to pass op_A = KrylovOperator(d_A) to bicgstab or bicgstab!.

I also suggest to use CUDA.@profile to check what is the most expensive operation on GPU for your linear system.

Note that you can also have a faster code on CPU if you use multithreaded matrix-vector products and optimized BLAS.
Just adding using MKL, MKLSparse before calling a Krylov solver can lead to a nice speed-up.
The question is about GPU, but it’s better to have a good baseline on CPU for comparison.

2 Likes

I suggest to also try CUDSS.jl.
It’s an interface to sparse direct methods (LDL’, LL’ or LU) on NVIDIA GPUs.
It’s still a preview library of NVIDIA but it works well.

It should be more efficient than Krylov.jl for small problems.

4 Likes

Thank you for your reply!In fact, what I want to solve is a large-scale linear equation.I tested a 17036*17036 system and it took about 20 seconds on the CPU and 10 seconds on the GPU.I want to use a preconditioner for Bicgstab but I don’t know how to do it on GPU.

I have to adjust the precision to Float64 to meet my engineering requirement

using CUDA
using SparseArrays
using LinearAlgebra
using Krylov
using BenchmarkTools
using JLD2
# Function: Using BiCGStab on CPU to solve the sparse matrix linear equation
function solve_sparse_cpu(A, b)
    x = zeros(Float64, size(b))
    x=Krylov.bicgstab(A, b)
    return x
end

# Function: Using BiCGStab on GPU to solve the sparse matrix linear equation
function solve_sparse_gpu(d_A, d_b)
    
    d_x = CUDA.zeros(Float64, size(d_b))  

    d_x=Krylov.bicgstab(d_A, d_b, d_x)

    return d_x  
end

@load "C:/Users/13733/Desktop/DistributionPowerFlow/J9241" J9241
@load "C:/Users/13733/Desktop/DistributionPowerFlow/F9241" F9241
J=J9241
F=F9241
matrix_sizes = size(J,1)

for n in matrix_sizes
    println("The scale of matrix A: $n x $n")

    A = Float64.(sparse(J))  
    b = Float64.(-F)
    d_A = CUDA.CUSPARSE.CuSparseMatrixCSC(A) 
    d_b = CuArray(b)            

    # CPU time consumption
    cpu_time = @belapsed solve_sparse_cpu($A, $b)
    println("CPU time consumption $cpu_time seconds")

    # GPU time consumption
    gpu_time = @belapsed solve_sparse_gpu($d_A, $d_b)
    println("GPU time consumption: $gpu_time seconds")

    println("speed up (CPU / GPU): $(cpu_time / gpu_time)")

    println("=====================================")
end

I give a few examples in the documentation of Krylov.jl:

On GPU, I suggest the following preconditioners:

  • Jacobi / block-Jacobi
  • Incomplete Cholesky with zero fill-in – IC(0)
  • Incomplete LU with zero fill-in – ILU(0)

All of them are available in KrylovPreconditioners.jl.
You can find some examples here of how to use kp_ic0, kp_il0 and BlockJacobiPreconditioner on GPUs.

3 Likes

Thank you very much for your help. I will try it