Optimizing the use of Blocks, Threads vs. Array Indexing

drjoke · December 5, 2017, 3:59pm

Assume that in Julia I have the following, which I assume should be OK because arrays in Julia are column major.

x = Array{Float32, 3}(500, 20, 200)
# Initialize x
for c = 1:200
    for b = 1:20
        for a = 1:500
            x[a, b, c] = x[a, b, c] + 1
        end
    end
end

Now, if I want to write this using CUDAdrv and CUDAnative, I may have

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
end

@cuda ((200, 20), 500) kernel(x)

Is the above code optimal with regard to

how I represent a, b, c using blocks and threads. is this the typical way to do it?
how I represent the array the same way as in Julia, or should reverse the index as x[c, b, a]. would that be faster?

Am I over thinking this? How much of a performance gain/hit will my decision affect.

maleadt · December 6, 2017, 10:12am

Is the above code optimal

Well, you can just benchmark it, right?

Given this slightly modified example which allows me to quickly shift indexes around (with the 1024 thread limit of my GPU):

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
    return
end

dx = CuArray{Float32,3}(32, 32, 32)
@cuda ((32, 32), 32) kernel(dx)

Now let’s profile this:

$ nvprof julia wip.jl
==15727== NVPROF is profiling process 15727, command: julia wip.jl
==15727== Profiling application: julia wip.jl
==15727== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  7.3280us         1  7.3280us  7.3280us  7.3280us  ptxcall_kernel_63191

Now swap the indexing around, and benchmark again. You’ll see that the situation where the innermost index is the fastest evolving one, triggers the global memory coalescing as I described in detail in CuArray is Row Major or Column Major? - #2 by maleadt

How much of a performance gain/hit will my decision affect.

On my GPU (first generation Titan), there’s a 2x penalty by indexing inefficiently. Of course, in the presence of other operations, and with a higher occupancy, this penalty might be significantly lower.

On an unrelated note, please mark threads as "solved’ if they’ve answered your questions.
Makes it easier to maintain overview of the GPU category

drjoke · December 6, 2017, 12:47pm

Sorry to have to ask this. Where is the menu option to block a thread as solved?

Tamas_Papp · December 6, 2017, 12:59pm

Little tickbox, below answers.

drjoke · December 6, 2017, 1:02pm

I made the index larger so I can see a bigger difference.

Here is Code A, where the innermost index is mapped to threadIdx().x.

using CUDAdrv, CUDAnative

function kernel(x)
    c = blockIdx().x
    b = blockIdx().y
    a = threadIdx().x
    x[a, b, c] = x[a, b, c] + 1
    return
end

dx = CuArray{Float32,3}(1024, 1024, 1024)
@cuda ((1024, 1024), 1024) kernel(dx)

With nvprof I get

GPU activities:  100.00%  24.763ms         1  24.763ms  24.763ms  24.763ms  ptxcall_kernel_61011

Now with Code B, where the innermost index is mapped to blockIdx().y, which I assume is the slowest evolving one.

using CUDAdrv, CUDAnative

function kernel(x)
    c = blockIdx().y
    b = blockIdx().x
    a = threadIdx().x
    x[c, b, a] = x[c, b, a] + 1
    return
end

dx = CuArray{Float32,3}(1024, 1024, 1024)
@cuda ((1024, 1024), 1024) kernel(dx)

With nvprof I get

GPU activities:  100.00%  1.23447s         1  1.23447s  1.23447s  1.23447s  ptxcall_kernel_61011

It is a difference of around 50 timex. So now I assume best practice is to always map the innermost index to threadIdx().x.

By innermost I mean leftmost, which I hope I am correct.

drjoke · December 6, 2017, 1:11pm

I am sorry, but this is all I can see. I do not see a tick box.

Tamas_Papp · December 6, 2017, 1:13pm

I think that it only shows up if the post was tagged with a question, and there are some other idiosyncratic rules apparently, you can ask in Meta Discussion - JuliaLang.

maleadt · December 6, 2017, 1:27pm

Yes correct. In that sense, it kinda corresponds with the guidelines for array programming on the CPU, but for different reasons:

A rule of thumb to keep in mind is that with column-major arrays, the first index changes most rapidly.

From https://docs.julialang.org/en/latest/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1

Also know that for simple time measurements, you can just as well use CUDAdrv.@elapsed (keeping in mind those measurements are less accurate because they include the initial compilation time, but also the overhead of calling into the CUDA API):

# slow kernel
julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
9.108586f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
2.5113032f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
2.4996145f0

# fast kernel

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
0.14859039f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
0.052447107f0

julia> CUDAdrv.@elapsed @cuda ((1024, 1024), 1024) kernel(dx)
0.0524136f0

Oh sorry I had assumed this was possible for all types of posts.

drjoke · December 6, 2017, 2:05pm

Thank you Maleadt, with your method I was able to pinpoint the bottleneck in my code. It turns out that my implementation of “Quicksort” is taking up the vast majority of my time. In my code, I have to run this sort about 4,000 times. So I just implemented a typical “Quicksort without recursion” onto CUDA, and run it in parallel.

I guess I will have to look for how to efficiently implement sort in CUDA.

maleadt · December 6, 2017, 2:15pm

quicksort doesn’t sound like it would be GPU friendly, did you try a parallel sorting algorithm? Some are pretty simple to implement, eg. bitonic sort.
I don’t have a Julia GPU implementation (maybe @sdanisch?), but it should be easy enough to port eg. https://github.com/maleadt/tracetransform/blob/public/cuda/src/kernels/sort.cu (written when I knew much less about GPUs so be careful )

ihnorton · December 6, 2017, 2:17pm

It is available now. The setting is per-category and was previously disabled.

Raf · September 21, 2018, 1:28pm

Sorry to necro-bump this thread but did anyone get anywhere with a GPU friendly sorting algorithm?

maleadt · September 21, 2018, 1:38pm

AFAIK there’s no Julia implementation yet. But there’s plenty of material on GPU sorting, and bitonic sort is pretty easy to implement with loads of sample code in CUDA.

Raf · September 21, 2018, 1:45pm

Thanks, I was just reading through your code in tracetransform. My C is pretty rusty but I might have to give it a go.

Any pointers on rewriting that as a CUDAnative kernel? I’ve only been using CuArrays up until now.

maleadt · September 21, 2018, 2:06pm

Have a look at eg. the pairwise example, it should demo the features you need to implement bitonic sort (basic indexing, shmem, wrapper function that computes launch bounds and executes @cuda).

@sdanisch would probably advocate you look at the vendor-agnostic abstractions from GPUArrays.jl (that is, equivalents to CUDA kernel APIs, but supporting multiple platforms). It might be easier to debug, having a CPU back-end, but porting simple kernels from CUDA is probably easier with CUDAnative. That’s up to you.

Raf · September 21, 2018, 3:09pm

Its probably better for me if it runs on OpenCL as well, and I’m already using GPUArrays a lot so I’ll probably try that approach first.

Topic		Replies	Views
GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong) GPU sort	12	2250	January 31, 2019
Is it possible to index a CuArray with a CuArray? GPU question	1	855	January 11, 2019
Mapping ThreadIdx().x to a 5D array? GPU	8	1241	June 15, 2018
CUDAnative: kernel multidimensional access GPU cudanative	3	1186	February 3, 2017
Row and column major order for arrays of different shape Performance column-major , row-major	23	2789	April 12, 2022

Optimizing the use of Blocks, Threads vs. Array Indexing

Related topics