The compiler doesn’t matter here, it’s your code that’s slow We don’t do the kind of optimizations you seem to expect (changing or optimizing memory access patterns).
To optimize this, you’ll need to look into the GPU’s architectural details and avoid costly operations. For example, you are hitting global memory here all the time, both loading from and writing to it through in
and out
. You could restructure your kernel to buffer this data in local memory. You are also doing so in a random pattern, defeating memory memory coalescing. Again, if you buffer locally and read/write to global memory in consecutive chunks, different memory transactions might be able to get merged. See for example CuArray is Row Major or Column Major?
I suggest you look into optimizing bandwidth-bound CUDA kernels (which yours seem to be) and get familiar with the necessary tools (notably nvprof/nsight-compute).
I’m not sure what you mean by this. Does an equivalent CUDA C kernel perform better? That would be the only fair comparison.
And FWIW, please try to post runnable snippets. It’s much harder to help if just brainstorming based on pseudocode.