Understanding GPU Kernels

I’m trying understand the basics of GPU kernels for Julia. I’ve followed http://mikeinnes.github.io/2017/08/24/cudanative.html and the basic example for addition makes sense

using CuArrays, CUDAnative

n = 1024
xs, ys, zs = CuArray(rand(n)), CuArray(rand(n)), CuArray(zeros(n))

function kernel_vadd(out, a, b)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  out[i] = a[i] + b[i]
  return
end

@cuda (1, n) kernel_vadd(zs, xs, ys)

On my graphics card, I’m limited to 1,024 threads and 4GB of video memory. My question is: if I package up some GPU function for others to use, how can I “automatically” determine how many blocks/threads to allocate on his/her graphics card? Additionally, what is the best practice for “wrapping” this @cuda(blocks, threads) function... code?

Insofar your algorithm allows for arbitrary launch parameters, you can query device limits using CUDAdrv. For example, see the CUDAnative pairwise example:

total_threads = min(n, attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK))

However, the best launch configuration depends on more than only the max threads and available memory. Often you want to maximize occupancy, which also depends on the kernel register pressure, cache behavior and/or shared memory usage.

Just wrap it in a function? See the CUDAnative reduce example, although for real-life usage this function should probably also accept a stream parameter.

2 Likes

Or you use GPUArrays.gpu_call, which has the added benefit that it can also run with CLArrays.


# Could also be CLArrays + CLArray
using CuArrays, GPUArrays
n = 1024
xs, ys, zs = CuArray(rand(n)), CuArray(rand(n)), CuArray(zeros(n))
dispatch_dummy = zs # for stream + dispatch to correct backend
args = (zs, xs, ys)
gpu_call(dispatch_dummy, args) do gpu_state, out, a, b
    i = linear_index(gpu_state)
    if i <= length(out)
        out[i] = a[i] + b[i]
    end
    return
end

With gpu_call the launch parameters are optional:
https://juliagpu.github.io/GPUArrays.jl/latest/#GPUArrays.gpu_call
And default to something reasonable, but as @maleadt indicated, you might need manual tuning.

3 Likes

Thanks @maleadt and @sdanisch! This was just what I was looking for.

Just to note, there’s nothing magical about the CuArrays implementation, it’s all just the same constructs – so it might be worth poking through if you’re interested in putting similar things together. If you write something that’s reasonably generic we’d even be happy to take PRs for it.

1 Like