CUDAnative: Using second and third dims in the kernel

In my attempts to generalize the example code given on the README.md I have had good results including processing arrays larger than the device blocksize, but have run into a block regarding the use of the second and third dimensions in the kernel.

I believe the code as presented processes the 3x4 array of floats as a 1x12 vector with the transfer back to an array somehow accomplished in the background. I guess I have a few questions: is it wise to ensure that the thread index does not exceed the dims of the known working area or is that taken care of transparently? My attempts to pass up a tuple for the grid dimensions in @cuda(…) were successful, but I am not getting sensible results from these efforts. I wonder if it might be helpful to include an example where we use at least the y component of the {x,y,z} set?

I can post some code if of interest.

1 Like

OK I think I have answered my own question and have it working, thanks.
However the puzzling thing is that in the kernel I can write it two ways, with the x dim alone or the x and y and both produce a result which passes the test. I’ll get it eventually.

Depends on the kernel preconditions. If you know that the index calculation will never yield an out-of-bounds index, you don’t have to. But when generalizing for larger arrays, that might not be possible (ie. 513 items on a max-512-threads device == 512 threads 2 blocks).

There’s a lot of existing literature on how to flexibly generalize kernels, e.g. writing grid-stride loops. You should also take care whether to launch more threads or more blocks, it determines occupancy and consequently performance, but also depends on the kernel and the hardware.

You can use @cuprintf to debug your index calculations, see this example.

1 Like