I am playing with Query.jl and it works fine with CPU code, I even measured some usecases and it works faster than using “classic” for loops code.
What I don’t know is, if it has any support for CUDA or any kind of guidelines as I see the biggest usage in kernel code. Does anyone have experience in using Query.jl in CUDA kernel code?
To be specific, I would like to filter the cycle that goes across the threads, for example something like this:
for i in index:stride:l |> @filter(_ != some_id)
...
Instead of:
for i in index:stride:l
if i == some_id
continue
end
...