Being able to write your own high-performance kernels, rather than being limited to a small menu of “vectorized built-ins”, is first-class support.
There may be a way to do what you want using built in CUDA.jl functions, of course; I’m not an expert in that package. But in the long run writing custom loops will give maximum flexibility and performance.
Before going any further, it’s necessary to know what behaviour you want out of this operation. Currently, the way you’re generating indices means that there will be duplicates. I presume you don’t want multiple values from the source array overwriting each other in the destination array, so what is supposed to happen when there are duplicates? The answer to this will dictate what functions you can use and whether you’ll need to write a custom kernel.
For now, I just want to perform the calculation shown in the top of the post.
You are right, that there is a multiple of the same calculation due to the nature of the provided example, but in my code I am debugging I_ and J_ will hold unique combinations only. So what is provided is just an example.
Ideally, I just want to understand why I see allocations using CUDA.@allocated, when it is a in-place operation, because using @CUDA.time, I don’t see allocs. So I am just confused
So when I used the approach you quote with a view, then for some reason all allocations were instead stored on CPU side, which meant that the code became much slower. I found this through use of @CUDA.time.
That was not part of the original question of course, but my question right now is;
Why is it slower when I use a view with “0” allocations? Is it due to the CPU taking over for some reason?
Perhaps I just start looking into learning how to make my own kernels and see if I understand better after
I’m not sure if the CUDA.jl docs mention this, but a good practice is to always call CUDA.allowscalar(false) before working with GPU code so that these accesses become errors instead of silently slowing down your code (assuming you didn’t see the scalar indexing warning).
Back on topic, the underlying reason for the slowness is that CUDA.jl doesn’t support many types of non-contiguous views. Because your I_ and J_ are not the same in the MWE and you do some operation (-) before writing to WGL, there’s not much you can do other than write your own kernel here. If any of those requirements don’t hold in your actual task, then it’s possible there’s a solution which wouldn’t require any custom kernels.