Understanding GPU Kernels

jacobcvt12 · April 10, 2018, 12:30am

I’m trying understand the basics of GPU kernels for Julia. I’ve followed http://mikeinnes.github.io/2017/08/24/cudanative.html and the basic example for addition makes sense

using CuArrays, CUDAnative

n = 1024
xs, ys, zs = CuArray(rand(n)), CuArray(rand(n)), CuArray(zeros(n))

function kernel_vadd(out, a, b)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  out[i] = a[i] + b[i]
  return
end

@cuda (1, n) kernel_vadd(zs, xs, ys)

On my graphics card, I’m limited to 1,024 threads and 4GB of video memory. My question is: if I package up some GPU function for others to use, how can I “automatically” determine how many blocks/threads to allocate on his/her graphics card? Additionally, what is the best practice for “wrapping” this @cuda(blocks, threads) function... code?

maleadt · April 10, 2018, 6:54am

Insofar your algorithm allows for arbitrary launch parameters, you can query device limits using CUDAdrv. For example, see the CUDAnative pairwise example:

total_threads = min(n, attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK))

However, the best launch configuration depends on more than only the max threads and available memory. Often you want to maximize occupancy, which also depends on the kernel register pressure, cache behavior and/or shared memory usage.

Just wrap it in a function? See the CUDAnative reduce example, although for real-life usage this function should probably also accept a stream parameter.

sdanisch · April 10, 2018, 9:17am

Or you use GPUArrays.gpu_call, which has the added benefit that it can also run with CLArrays.


# Could also be CLArrays + CLArray
using CuArrays, GPUArrays
n = 1024
xs, ys, zs = CuArray(rand(n)), CuArray(rand(n)), CuArray(zeros(n))
dispatch_dummy = zs # for stream + dispatch to correct backend
args = (zs, xs, ys)
gpu_call(dispatch_dummy, args) do gpu_state, out, a, b
    i = linear_index(gpu_state)
    if i <= length(out)
        out[i] = a[i] + b[i]
    end
    return
end

With gpu_call the launch parameters are optional:
https://juliagpu.github.io/GPUArrays.jl/latest/#GPUArrays.gpu_call
And default to something reasonable, but as @maleadt indicated, you might need manual tuning.

jacobcvt12 · April 10, 2018, 1:50pm

Thanks @maleadt and @sdanisch! This was just what I was looking for.

MikeInnes · April 10, 2018, 3:05pm

Just to note, there’s nothing magical about the CuArrays implementation, it’s all just the same constructs – so it might be worth poking through if you’re interested in putting similar things together. If you write something that’s reasonably generic we’d even be happy to take PRs for it.

Topic		Replies	Views
@cuda threads and blocks confusion GPU	9	3674	February 10, 2021
Base function in Cuda kernels General Usage cudanative , cuda	8	3209	March 15, 2019
The most general way to estimate the optimal arguments for @cuda macro Performance gpu , cudanative	6	1776	April 6, 2021
Understanding Blocks and Threads GPU gpu	2	7037	November 23, 2017
How to use multiple GPUs correctly? GPU question	2	2739	October 16, 2019

Understanding GPU Kernels

Related topics