Synchronizing Cuda kernels

Which version of CUDAdrv/Julia are you using? You should never see CuError(701, nothing), but ERROR_LAUNCH_OUT_OF_RESOURCES instead.

So yeah, you’re exhausting resources of the GPU (either in terms of registers, shared memory, or plainly exceeding the maximum number of threads or blocks you’re allowed to launch in each direction). You can use the APIs to query these limits.

Properties of the device:

julia> using CUDAdrv, CUDAnative

julia> dev = device()
CuDevice(0): GeForce GTX 970

julia> attribute(dev, CUDAdrv.MAX_BLOCK_DIM_X)
1024

Properties of a compiled kernel:

julia> function vadd(a, b, c)
           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
           c[i] = a[i] + b[i]
           return
       end
vadd (generic function with 1 method)

julia> kernel = cufunction(vadd, NTuple{3,CuDeviceArray{Float32,2,AS.Global}})
[ Info: Building the CUDAnative run-time library for your sm_52 device, this might take a while...
CUDAnative.HostKernel{vadd,Tuple{CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global}}}(CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true), CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true)), CuFunction(Ptr{Nothing} @0x0000000004464aa0, CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true))))

julia> CUDAnative.registers(kernel)
22

julia> CUDAnative.memory(kernel)
(local = 104, shared = 0, constant = 0)

julia> CUDAnative.maxthreads(kernel)
1024

Or simply use the occupancy API to have CUDA pick a number of threads, and update your indexing to be able to handle that:

# adjust the kernel to perform a bounds check
function vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= length(c)
        c[i] = a[i] + b[i]
    end
    return
end

# generate data
...

# old hardcoded launch
len = prod(dims)
#@cuda threads=len vadd(d_a, d_b, d_c)

# instead use a callback to query the allowed & optimal number of threads
function get_config(kernel)
    fun = kernel.fun
    config = launch_configuration(fun)

    blocks = cld(len, config.threads)
    return (threads=config.threads, blocks=blocks)
end
@cuda config=get_config vadd(d_a, d_b, d_c)
1 Like