Synchronizing Cuda kernels

maleadt · September 3, 2019, 1:06pm

Which version of CUDAdrv/Julia are you using? You should never see CuError(701, nothing), but ERROR_LAUNCH_OUT_OF_RESOURCES instead.

So yeah, you’re exhausting resources of the GPU (either in terms of registers, shared memory, or plainly exceeding the maximum number of threads or blocks you’re allowed to launch in each direction). You can use the APIs to query these limits.

Properties of the device:

julia> using CUDAdrv, CUDAnative

julia> dev = device()
CuDevice(0): GeForce GTX 970

julia> attribute(dev, CUDAdrv.MAX_BLOCK_DIM_X)
1024

Properties of a compiled kernel:

julia> function vadd(a, b, c)
           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
           c[i] = a[i] + b[i]
           return
       end
vadd (generic function with 1 method)

julia> kernel = cufunction(vadd, NTuple{3,CuDeviceArray{Float32,2,AS.Global}})
[ Info: Building the CUDAnative run-time library for your sm_52 device, this might take a while...
CUDAnative.HostKernel{vadd,Tuple{CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global}}}(CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true), CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true)), CuFunction(Ptr{Nothing} @0x0000000004464aa0, CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true))))

julia> CUDAnative.registers(kernel)
22

julia> CUDAnative.memory(kernel)
(local = 104, shared = 0, constant = 0)

julia> CUDAnative.maxthreads(kernel)
1024

Or simply use the occupancy API to have CUDA pick a number of threads, and update your indexing to be able to handle that:

# adjust the kernel to perform a bounds check
function vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= length(c)
        c[i] = a[i] + b[i]
    end
    return
end

# generate data
...

# old hardcoded launch
len = prod(dims)
#@cuda threads=len vadd(d_a, d_b, d_c)

# instead use a callback to query the allowed & optimal number of threads
function get_config(kernel)
    fun = kernel.fun
    config = launch_configuration(fun)

    blocks = cld(len, config.threads)
    return (threads=config.threads, blocks=blocks)
end
@cuda config=get_config vadd(d_a, d_b, d_c)

Topic		Replies	Views
@cuda threads and blocks confusion GPU	9	3518	February 10, 2021
Understanding GPU Kernels GPU	4	2552	April 10, 2018
Notes on `CUDA.sync_threads` and dispatch on `Union` GPU gpu	3	984	April 16, 2021
Understanding Blocks and Threads GPU gpu	2	6960	November 23, 2017
GPU Synchronization Issue - using KernelAbstraction GPU question	5	403	December 13, 2023

Synchronizing Cuda kernels

Related topics