Assume that in Julia I have the following, which I assume should be OK because arrays in Julia are column major.
x = Array{Float32, 3}(500, 20, 200)
# Initialize x
for c = 1:200
for b = 1:20
for a = 1:500
x[a, b, c] = x[a, b, c] + 1
end
end
end
Now, if I want to write this using CUDAdrv and CUDAnative, I may have
function kernel(x)
c = blockIdx().x
b = blockIdx().y
a = threadIdx().x
x[a, b, c] = x[a, b, c] + 1
end
@cuda ((200, 20), 500) kernel(x)
Is the above code optimal with regard to
- how I represent a, b, c using blocks and threads. is this the typical way to do it?
- how I represent the array the same way as in Julia, or should reverse the index as x[c, b, a]. would that be faster?
Am I over thinking this? How much of a performance gain/hit will my decision affect.