Question about CUDA kernels

EduMolinero · February 9, 2023, 12:55pm

Hi!

I’m struggling to implement the following CUDA kernel:

using CUDA
using LinearAlgebra

function kernel!(A, B, C)
    X = CUDA.size(A)[3]
    Y = CUDA.size(B)[3]
    for j in 1:X
        for i in 1:Y
            @inbounds C[:,:,j] += B[:,:,i]  * A[:,:,j]
        end
    end
    return 
end

function benchmark_gpu(A, B, C)
    CUDA.@sync begin
        @cuda threads=256 kernel!(A, B, C)
    end
    return nothing
end

A = CUDA.rand(ComplexF64,4,4,1000)
B = CUDA.rand(ComplexF64,4,4,10)
C = CUDA.zeros(ComplexF64,4,4,1000)

benchmark_gpu(A, B, C)

When I run it as a script, it throws the following error:

ERROR: LoadError: InvalidIRError: compiling kernel #kernel!(CuDeviceArray{ComplexF64, 3, 1}, CuDeviceArray{ComplexF64, 3, 1}, CuDeviceArray{ComplexF64, 3, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to ijl_rethrow)
Stacktrace:
 [1] rethrow
   @ ./error.jl:61
 [2] task_local_storage
   @ ./task.jl:294
 [3] allowscalar
   @ ~/.julia/packages/GPUArraysCore/B3xv7/src/GPUArraysCore.jl:58
 [4] kernel!
   @ ~/cuda_tests/discourse.jl:7
....

At first I thought that the error was cause by broadcasting during the matrix multiplication step. However, I can do the operation C[:,:,j] += B[:,:,i] * A[:,:,j] when I’m inside a Julia REPL session.
How’s that possible?

I’m aware that scalar indexing is not the best practice when it comes to maximize performance on the GPU, but I would like to understand why the code is failing anyway.

maleadt · February 9, 2023, 3:55pm

Code in your REPL executes in a CPU environment (even though it may in turn call into the GPU), whereas kernel code executes directly on the GPU. You can only perform much more simple operations there, and not the matrix multiplication or broadcast you’re doing.

mcabbott · February 9, 2023, 4:31pm

If this operation is what you want, it’s easy to do without writing kernels. It is C[x,y,j] = B[x,z,i] * A[z,y,j] summed on z,i,j, and the sum on i can be done first:

using TensorCore  # reshape + one matrix multiplication
boxdot!(C, dropdims(sum(B, dims=3), dims=3), A)

using NNlib  # CUDA's gemm_strided_batched!
batched_mul!(C, sum(B, dims=3), A)

EduMolinero · February 10, 2023, 3:24pm

Oh I get it. Thank you for the answer!!

EduMolinero · February 10, 2023, 3:33pm

That does exactly what I wanted.
Thank you very much!!

However, in order to properly use the NNlib option, I needed to load the CUDA version of the library NNlibCUDA.