Optimizing the use of Blocks, Threads vs. Array Indexing

Assume that in Julia I have the following, which I assume should be OK because arrays in Julia are column major.

x = Array{Float32, 3}(500, 20, 200)
# Initialize x
for c = 1:200
    for b = 1:20
        for a = 1:500
            x[a, b, c] = x[a, b, c] + 1
        end
    end
end

Now, if I want to write this using CUDAdrv and CUDAnative, I may have

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
end

@cuda ((200, 20), 500) kernel(x)

Is the above code optimal with regard to

  1. how I represent a, b, c using blocks and threads. is this the typical way to do it?
  2. how I represent the array the same way as in Julia, or should reverse the index as x[c, b, a]. would that be faster?

Am I over thinking this? How much of a performance gain/hit will my decision affect.

Is the above code optimal

Well, you can just benchmark it, right?

Given this slightly modified example which allows me to quickly shift indexes around (with the 1024 thread limit of my GPU):

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
    return
end

dx = CuArray{Float32,3}(32, 32, 32)
@cuda ((32, 32), 32) kernel(dx)

Now let’s profile this:

$ nvprof julia wip.jl
==15727== NVPROF is profiling process 15727, command: julia wip.jl
==15727== Profiling application: julia wip.jl
==15727== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  7.3280us         1  7.3280us  7.3280us  7.3280us  ptxcall_kernel_63191

Now swap the indexing around, and benchmark again. You’ll see that the situation where the innermost index is the fastest evolving one, triggers the global memory coalescing as I described in detail in CuArray is Row Major or Column Major? - #2 by maleadt

How much of a performance gain/hit will my decision affect.

On my GPU (first generation Titan), there’s a 2x penalty by indexing inefficiently. Of course, in the presence of other operations, and with a higher occupancy, this penalty might be significantly lower.

On an unrelated note, please mark threads as "solved’ if they’ve answered your questions.
Makes it easier to maintain overview of the GPU category :slight_smile:

Sorry to have to ask this. Where is the menu option to block a thread as solved?

Little tickbox, below answers.

I made the index larger so I can see a bigger difference.

Here is Code A, where the innermost index is mapped to threadIdx().x.

using CUDAdrv, CUDAnative

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
    return
end

dx = CuArray{Float32,3}(1024, 1024, 1024)
@cuda ((1024, 1024), 1024) kernel(dx)

With nvprof I get

GPU activities:  100.00%  24.763ms         1  24.763ms  24.763ms  24.763ms  ptxcall_kernel_61011

Now with Code B, where the innermost index is mapped to blockIdx().y, which I assume is the slowest evolving one.

using CUDAdrv, CUDAnative

function kernel(x)
    c = blockIdx().y
    b = blockIdx().x
    a = threadIdx().x
    x[c, b, a] = x[c, b, a] + 1
    return
end

dx = CuArray{Float32,3}(1024, 1024, 1024)
@cuda ((1024, 1024), 1024) kernel(dx)

With nvprof I get

GPU activities:  100.00%  1.23447s         1  1.23447s  1.23447s  1.23447s  ptxcall_kernel_61011

It is a difference of around 50 timex. So now I assume best practice is to always map the innermost index to threadIdx().x.

By innermost I mean leftmost, which I hope I am correct.

I am sorry, but this is all I can see. I do not see a tick box.

I think that it only shows up if the post was tagged with a question, and there are some other idiosyncratic rules apparently, you can ask in Meta Discussion - JuliaLang.

Yes correct. In that sense, it kinda corresponds with the guidelines for array programming on the CPU, but for different reasons:

A rule of thumb to keep in mind is that with column-major arrays, the first index changes most rapidly.

From https://docs.julialang.org/en/latest/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1

Also know that for simple time measurements, you can just as well use CUDAdrv.@elapsed (keeping in mind those measurements are less accurate because they include the initial compilation time, but also the overhead of calling into the CUDA API):

# slow kernel
julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
9.108586f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
2.5113032f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
2.4996145f0

# fast kernel

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
0.14859039f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
0.052447107f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
0.0524136f0

Oh sorry I had assumed this was possible for all types of posts.

Thank you Maleadt, with your method I was able to pinpoint the bottleneck in my code. It turns out that my implementation of “Quicksort” is taking up the vast majority of my time. In my code, I have to run this sort about 4,000 times. So I just implemented a typical “Quicksort without recursion” onto CUDA, and run it in parallel.

I guess I will have to look for how to efficiently implement sort in CUDA.

quicksort doesn’t sound like it would be GPU friendly, did you try a parallel sorting algorithm? Some are pretty simple to implement, eg. bitonic sort.
I don’t have a Julia GPU implementation (maybe @sdanisch?), but it should be easy enough to port eg. https://github.com/maleadt/tracetransform/blob/public/cuda/src/kernels/sort.cu (written when I knew much less about GPUs so be careful :slight_smile: )

It is available now. The setting is per-category and was previously disabled.

1 Like

Sorry to necro-bump this thread but did anyone get anywhere with a GPU friendly sorting algorithm?

AFAIK there’s no Julia implementation yet. But there’s plenty of material on GPU sorting, and bitonic sort is pretty easy to implement with loads of sample code in CUDA.

Thanks, I was just reading through your code in tracetransform. My C is pretty rusty but I might have to give it a go.

Any pointers on rewriting that as a CUDAnative kernel? I’ve only been using CuArrays up until now.

Have a look at eg. the pairwise example, it should demo the features you need to implement bitonic sort (basic indexing, shmem, wrapper function that computes launch bounds and executes @cuda).

@sdanisch would probably advocate you look at the vendor-agnostic abstractions from GPUArrays.jl (that is, equivalents to CUDA kernel APIs, but supporting multiple platforms). It might be easier to debug, having a CPU back-end, but porting simple kernels from CUDA is probably easier with CUDAnative. That’s up to you.

1 Like

Its probably better for me if it runs on OpenCL as well, and I’m already using GPUArrays a lot so I’ll probably try that approach first.