Solve linear systems inside CUDA kernel function

FR13ndSDP · February 14, 2024, 8:52am

Hello everyone, I was wondering how to solve a small linear system Ax = b inside a CUDA kernel function, where A, x and b are MMatrix and MVector. I want to do this with x = A\b, but it does not work inside a kernel, is there a solution to this problem?

ChrisRackauckas · February 14, 2024, 11:14am

If you use static arrays it won’t be an issue. This is done in DiffEqGPU.jl

FR13ndSDP · February 14, 2024, 11:47am

Hi Chris, unfortunately I haven’t been able to solve this problem yet. I hope this code snippet illustrates the problem:

using CUDA, StaticArrays

const N = 20

function test()
    i = (blockIdx().x-1)* blockDim().x + threadIdx().x

    # each thread has unique A and b
    a = @MMatrix rand(Float64, N, N)
    b = @MVector rand(Float64, N)
    c = MVector{N, Float64}(undef)

    # This works
    c = a * b

    # But this does not
    c = a \ b

    return
end

@cuda threads=10 test()

Also, is it possible to do batched small linear systems solving with CUBLAS or CUSOLVER?

Zentrik · February 14, 2024, 12:07pm

You want to use a SMatrix and Svector not a MMatrix, Mvector I assume.

FR13ndSDP · February 14, 2024, 1:47pm

In my case, A and b will be constructed inside the kernel, so they have to be mutable. Moreover, use SMatrix and SVector does not help in the snippet above.

trahflow · February 14, 2024, 1:58pm

That’s very unlikely to work. You cannot dynamically allocate memory inside a GPU kernel (see also this recent post: Modifying a thread-local vector within CUDA Dynamic Parallelism - #2 by vchuravy).

What should work though is to allocate all CuArrays outside the kernel, then inside the kernel convert the relevant views into your arrays into SMatrix/SVectors and do the solve on StaticArrays only. (I don’t have access to a GPU atm to check)

FR13ndSDP · February 14, 2024, 2:55pm

Allocate memory with MMatrix and MVector works fine, I think the problem is the \ operation has some allocations, I’ve tried to implement the Gauss elimination method to solve linear equations, it works well on GPU, but I’m worried about its performance.

utkarsh530 · February 14, 2024, 4:21pm

Hi, can you try for some N < 14? IIUC, there were some allocations here: StaticArrays.jl/src/solve.jl at master · JuliaArrays/StaticArrays.jl · GitHub

We can probably try to get that dispatch setup in LinearSolve.jl but not sure as the previous approach may be done for performance reasons.

FR13ndSDP · February 14, 2024, 4:54pm

Yes, you are right! For N \leq 14 it works well. But in my case, the typical size is N=[20,200]. I think with my implementation of Gauss elimination will be faster than A\b if N \leq 14.

Topic		Replies	Views
Problems with LinearAlgebra functions within KernelAbstractions and CUDA General Usage cuda , linearalgebra , kernelabstractions	9	671	February 22, 2024
Local thread memory in GPU using StaticArrays GPU question , gpu , cuda	4	6230	January 26, 2020
Using MVector in CUDA without memory errors GPU	3	410	October 17, 2023
Linear system solution not working in CUDA General Usage cuda , linearalgebra , linearsolve	4	102	March 1, 2025
Best way to use CuSparseMatrixBSR GPU question , gpu , linearalgebra , sparse	7	993	August 21, 2022

Solve linear systems inside CUDA kernel function

Related topics