Is the above code optimal
Well, you can just benchmark it, right?
Given this slightly modified example which allows me to quickly shift indexes around (with the 1024 thread limit of my GPU):
function kernel(x)
c = blockIdx().x
b = blockIdx().y
a = threadIdx().x
x[a, b, c] = x[a, b, c] + 1
return
end
dx = CuArray{Float32,3}(32, 32, 32)
@cuda ((32, 32), 32) kernel(dx)
Now let’s profile this:
$ nvprof julia wip.jl
==15727== NVPROF is profiling process 15727, command: julia wip.jl
==15727== Profiling application: julia wip.jl
==15727== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 7.3280us 1 7.3280us 7.3280us 7.3280us ptxcall_kernel_63191
Now swap the indexing around, and benchmark again. You’ll see that the situation where the innermost index is the fastest evolving one, triggers the global memory coalescing as I described in detail in CuArray is Row Major or Column Major? - #2 by maleadt
How much of a performance gain/hit will my decision affect.
On my GPU (first generation Titan), there’s a 2x penalty by indexing inefficiently. Of course, in the presence of other operations, and with a higher occupancy, this penalty might be significantly lower.
On an unrelated note, please mark threads as "solved’ if they’ve answered your questions.
Makes it easier to maintain overview of the GPU category