Welcome to the Julia community!
Please read PSA: how to quote code with backticks to improve your formatting.
I’ve tried to run your code with CUDABackend()
. Apart from changing a ROCArray
into a CuArray
I also had to
- change
temp = @localmem(Int32, group_sz)
into@temp = @localmem(Int32, GROUP_SIZE)
- remove the
wait(total_found)
- add
KADevice
tooutput_indices = KernelAbstractions.zeros(Int32, count)
.
Then I get Int32[3, 4, 6, 8]
, which I assume is the desired output.
Based on the documentation, I cannot really tell what is the point of @private
. Coming from CUDA.jl I don’t see why you couldn’t just write local_d = 1
. And indeed you can: the CUDABackend()
code runs perfectly fine in this manner. But it does seem important when using CPU()
as backend.
The issue when using @private local_d = 1
turns out to be in the local_d *= 2
line. Seemingly local_d
across threads is represented as an NTuple{256, Int64}
(with 256 == @groupsize()
) and stuff starts to break down after the (attempted) reassignment. So a MWE for the issue is
julia> @kernel function kern()
@private var = 1
var *= 2
end
julia> kern(CPU(), 1, 1)()
ERROR: MethodError: no method matching setindex!(::Tuple{Int64}, ::Int64, ::Int64)
(...)
This looks like a bug to me.