Welcome to the Julia community!
Please read PSA: how to quote code with backticks to improve your formatting.
I’ve tried to run your code with CUDABackend(). Apart from changing a ROCArray into a CuArray I also had to
- change
temp = @localmem(Int32, group_sz)into@temp = @localmem(Int32, GROUP_SIZE) - remove the
wait(total_found) - add
KADevicetooutput_indices = KernelAbstractions.zeros(Int32, count).
Then I get Int32[3, 4, 6, 8], which I assume is the desired output.
Based on the documentation, I cannot really tell what is the point of @private. Coming from CUDA.jl I don’t see why you couldn’t just write local_d = 1. And indeed you can: the CUDABackend() code runs perfectly fine in this manner. But it does seem important when using CPU() as backend.
The issue when using @private local_d = 1 turns out to be in the local_d *= 2 line. Seemingly local_d across threads is represented as an NTuple{256, Int64} (with 256 == @groupsize()) and stuff starts to break down after the (attempted) reassignment. So a MWE for the issue is
julia> @kernel function kern()
@private var = 1
var *= 2
end
julia> kern(CPU(), 1, 1)()
ERROR: MethodError: no method matching setindex!(::Tuple{Int64}, ::Int64, ::Int64)
(...)
This looks like a bug to me.