I have a toy example where the goal is to process a 3x3 matrix of integers on the GPU and do an element wise doubling of the numbers. I have no problem doing this allowing CUDAnative to linearize the array to a vector, but attempting to process the array as a 3x3 on the GPU is puzzling. Here is my toy example which produces the right answer but for the wrong reason.
using CUDAdrv, CUDAnative
function kernel_mmul(a, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    j = (blockIdx().y-1) * blockDim().y + threadIdx().y
    c[i,j] = a[i,j].*2
    @cuprintf(" %d %d %d %d\n",i,j,c[i,j],threadIdx().y)
    return nothing
end
dev = CuDevice(0)
ctx = CuContext(dev)
a = Int32[1 2 3; 2 3 1; 3 1 2]
d_a = CuArray(a)
d_c = similar(d_a) 
@cuda ((1,1),(3,3)) kernel_mmul(d_a, d_c)
c = Array(d_c)
println(a)
println(c)
destroy(ctx)
For some reason the index j in the kernel is always zero. So I guess multiple blocks of i are processed to get the answer. The count of iterations is correct each time and the result is correct as long as the process does not end in error due to poor choice of grid and block combinations. Of note is that blockDim seems to be zero which is counter intuitive.