Optimizing the use of Blocks, Threads vs. Array Indexing

maleadt · December 6, 2017, 10:12am

Is the above code optimal

Well, you can just benchmark it, right?

Given this slightly modified example which allows me to quickly shift indexes around (with the 1024 thread limit of my GPU):

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
    return
end

dx = CuArray{Float32,3}(32, 32, 32)
@cuda ((32, 32), 32) kernel(dx)

Now let’s profile this:

$ nvprof julia wip.jl
==15727== NVPROF is profiling process 15727, command: julia wip.jl
==15727== Profiling application: julia wip.jl
==15727== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  7.3280us         1  7.3280us  7.3280us  7.3280us  ptxcall_kernel_63191

Now swap the indexing around, and benchmark again. You’ll see that the situation where the innermost index is the fastest evolving one, triggers the global memory coalescing as I described in detail in CuArray is Row Major or Column Major? - #2 by maleadt

How much of a performance gain/hit will my decision affect.

On my GPU (first generation Titan), there’s a 2x penalty by indexing inefficiently. Of course, in the presence of other operations, and with a higher occupancy, this penalty might be significantly lower.

On an unrelated note, please mark threads as "solved’ if they’ve answered your questions.
Makes it easier to maintain overview of the GPU category

Topic		Replies	Views
CUDA \| nested loops kernel GPU question	5	169	May 12, 2025
Row and column major order for arrays of different shape Performance column-major , row-major	23	2712	April 12, 2022
Simple CUDA kernel on matrix slower than running GPU GPU gpu , cuda , matrix	8	556	June 3, 2024
I32 indexing GPU	8	430	March 24, 2025
GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong) GPU sort	12	2198	January 31, 2019

Optimizing the use of Blocks, Threads vs. Array Indexing

Related topics