Hi!

I’m struggling to implement the following CUDA kernel:

```
using CUDA
using LinearAlgebra
function kernel!(A, B, C)
X = CUDA.size(A)[3]
Y = CUDA.size(B)[3]
for j in 1:X
for i in 1:Y
@inbounds C[:,:,j] += B[:,:,i] * A[:,:,j]
end
end
return
end
function benchmark_gpu(A, B, C)
CUDA.@sync begin
@cuda threads=256 kernel!(A, B, C)
end
return nothing
end
A = CUDA.rand(ComplexF64,4,4,1000)
B = CUDA.rand(ComplexF64,4,4,10)
C = CUDA.zeros(ComplexF64,4,4,1000)
benchmark_gpu(A, B, C)
```

When I run it as a script, it throws the following error:

```
ERROR: LoadError: InvalidIRError: compiling kernel #kernel!(CuDeviceArray{ComplexF64, 3, 1}, CuDeviceArray{ComplexF64, 3, 1}, CuDeviceArray{ComplexF64, 3, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to ijl_rethrow)
Stacktrace:
[1] rethrow
@ ./error.jl:61
[2] task_local_storage
@ ./task.jl:294
[3] allowscalar
@ ~/.julia/packages/GPUArraysCore/B3xv7/src/GPUArraysCore.jl:58
[4] kernel!
@ ~/cuda_tests/discourse.jl:7
....
```

At first I thought that the error was cause by broadcasting during the matrix multiplication step. However, I can do the operation `C[:,:,j] += B[:,:,i] * A[:,:,j]`

when I’m inside a Julia REPL session.

How’s that possible?

I’m aware that scalar indexing is not the best practice when it comes to maximize performance on the GPU, but I would like to understand why the code is failing anyway.