How to fix this function?
Also how to fix the line #@atomic counts[indices[idx]] += 1, it doesn’t compile at all.

function count_indices(indices::CuArray{Int64,1}, maxSize::Int64)
# Initialize a CuArray of zeros with size maxSize
counts = CUDA.zeros(Int64, maxSize)
# Define the kernel function
function kernel(indices, counts)
idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if idx <= length(indices)
#@atomic counts[indices[idx]] += 1
CUDA.atomic_add!(counts, indices[idx], 1)
end
return
end
# Launch the kernel
threads = 256
blocks = cld(length(indices), threads)
@cuda threads=threads blocks=blocks kernel(indices, counts)
return counts
end
count_indices(CuArray([1, 2, 3, 1, 3, 3, 3, 1, 1, 1, 1]), 4)

This is the error I get in julia v1.9.1:
ERROR: InvalidIRError: compiling MethodInstance for (::var"kernel#7")(::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)

If you are going to call this function a lot, preallocate count outside and make a in-place function, count_indices!. Then you can always define the function to do everything at once again

FWIW, the underlying issue was probably that atomic_add! (and all other low-level atomic intrinsics) are really strict wrt. which types of arguments they accept, while CUDA.@atomic performs automatic conversions.

I was also doing the same today. Seems like GPUs in general cannot atomically add to multiple values at once.

If you have a vector of SVectors, you can use reshape(reinterpret(Float32, X), 3, length(X)) to get a suitable vector for componentwise atomic actions.