CUDAnative: kernel multidimensional access



I have a toy example where the goal is to process a 3x3 matrix of integers on the GPU and do an element wise doubling of the numbers. I have no problem doing this allowing CUDAnative to linearize the array to a vector, but attempting to process the array as a 3x3 on the GPU is puzzling. Here is my toy example which produces the right answer but for the wrong reason.

using CUDAdrv, CUDAnative

function kernel_mmul(a, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    j = (blockIdx().y-1) * blockDim().y + threadIdx().y
    c[i,j] = a[i,j].*2
    @cuprintf(" %d %d %d %d\n",i,j,c[i,j],threadIdx().y)
    return nothing

dev = CuDevice(0)
ctx = CuContext(dev)
a = Int32[1 2 3; 2 3 1; 3 1 2]
d_a = CuArray(a)
d_c = similar(d_a) 
@cuda ((1,1),(3,3)) kernel_mmul(d_a, d_c)
c = Array(d_c)

For some reason the index j in the kernel is always zero. So I guess multiple blocks of i are processed to get the answer. The count of iterations is correct each time and the result is correct as long as the process does not end in error due to poor choice of grid and block combinations. Of note is that blockDim seems to be zero which is counter intuitive.


Your index calculation is correct, but by doing -1 it gets promoted to Int64, which means your format specifier is wrong. Either use %ld, or do -Int32(1).

Relevant issue:


Wow! You are right. The format string " %d %ld %d %d\n" works. So it might be informative that the variable “i” seems to be ambivalent about Int32 or Int64, but the variable “j” is very sensitive to that setting, even using the exact same value/type for the -1 component.


Huh, curious. You should be using %ld for both though. Or even better, keep i and j 32-bits (although it doesn’t matter much in this case).