Optimizing the use of Blocks, Threads vs. Array Indexing

Is the above code optimal

Well, you can just benchmark it, right?

Given this slightly modified example which allows me to quickly shift indexes around (with the 1024 thread limit of my GPU):

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
    return
end

dx = CuArray{Float32,3}(32, 32, 32)
@cuda ((32, 32), 32) kernel(dx)

Now let’s profile this:

$ nvprof julia wip.jl
==15727== NVPROF is profiling process 15727, command: julia wip.jl
==15727== Profiling application: julia wip.jl
==15727== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  7.3280us         1  7.3280us  7.3280us  7.3280us  ptxcall_kernel_63191

Now swap the indexing around, and benchmark again. You’ll see that the situation where the innermost index is the fastest evolving one, triggers the global memory coalescing as I described in detail in CuArray is Row Major or Column Major? - #2 by maleadt

How much of a performance gain/hit will my decision affect.

On my GPU (first generation Titan), there’s a 2x penalty by indexing inefficiently. Of course, in the presence of other operations, and with a higher occupancy, this penalty might be significantly lower.

On an unrelated note, please mark threads as "solved’ if they’ve answered your questions.
Makes it easier to maintain overview of the GPU category :slight_smile: