Question about CUDA kernels

Hi!

I’m struggling to implement the following CUDA kernel:

using CUDA
using LinearAlgebra

function kernel!(A, B, C)
    X = CUDA.size(A)[3]
    Y = CUDA.size(B)[3]
    for j in 1:X
        for i in 1:Y
            @inbounds C[:,:,j] += B[:,:,i]  * A[:,:,j]
        end
    end
    return 
end

function benchmark_gpu(A, B, C)
    CUDA.@sync begin
        @cuda threads=256 kernel!(A, B, C)
    end
    return nothing
end

A = CUDA.rand(ComplexF64,4,4,1000)
B = CUDA.rand(ComplexF64,4,4,10)
C = CUDA.zeros(ComplexF64,4,4,1000)

benchmark_gpu(A, B, C)

When I run it as a script, it throws the following error:

ERROR: LoadError: InvalidIRError: compiling kernel #kernel!(CuDeviceArray{ComplexF64, 3, 1}, CuDeviceArray{ComplexF64, 3, 1}, CuDeviceArray{ComplexF64, 3, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to ijl_rethrow)
Stacktrace:
 [1] rethrow
   @ ./error.jl:61
 [2] task_local_storage
   @ ./task.jl:294
 [3] allowscalar
   @ ~/.julia/packages/GPUArraysCore/B3xv7/src/GPUArraysCore.jl:58
 [4] kernel!
   @ ~/cuda_tests/discourse.jl:7
....

At first I thought that the error was cause by broadcasting during the matrix multiplication step. However, I can do the operation C[:,:,j] += B[:,:,i] * A[:,:,j] when I’m inside a Julia REPL session.
How’s that possible?

I’m aware that scalar indexing is not the best practice when it comes to maximize performance on the GPU, but I would like to understand why the code is failing anyway.

Code in your REPL executes in a CPU environment (even though it may in turn call into the GPU), whereas kernel code executes directly on the GPU. You can only perform much more simple operations there, and not the matrix multiplication or broadcast you’re doing.

If this operation is what you want, it’s easy to do without writing kernels. It is C[x,y,j] = B[x,z,i] * A[z,y,j] summed on z,i,j, and the sum on i can be done first:

using TensorCore  # reshape + one matrix multiplication
boxdot!(C, dropdims(sum(B, dims=3), dims=3), A)

using NNlib  # CUDA's gemm_strided_batched!
batched_mul!(C, sum(B, dims=3), A)

Oh I get it. Thank you for the answer!!

That does exactly what I wanted.
Thank you very much!!

However, in order to properly use the NNlib option, I needed to load the CUDA version of the library NNlibCUDA.