Which version of CUDAdrv/Julia are you using? You should never see CuError(701, nothing)
, but ERROR_LAUNCH_OUT_OF_RESOURCES
instead.
So yeah, you’re exhausting resources of the GPU (either in terms of registers, shared memory, or plainly exceeding the maximum number of threads or blocks you’re allowed to launch in each direction). You can use the APIs to query these limits.
Properties of the device:
julia> using CUDAdrv, CUDAnative
julia> dev = device()
CuDevice(0): GeForce GTX 970
julia> attribute(dev, CUDAdrv.MAX_BLOCK_DIM_X)
1024
Properties of a compiled kernel:
julia> function vadd(a, b, c)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
c[i] = a[i] + b[i]
return
end
vadd (generic function with 1 method)
julia> kernel = cufunction(vadd, NTuple{3,CuDeviceArray{Float32,2,AS.Global}})
[ Info: Building the CUDAnative run-time library for your sm_52 device, this might take a while...
CUDAnative.HostKernel{vadd,Tuple{CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global}}}(CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true), CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true)), CuFunction(Ptr{Nothing} @0x0000000004464aa0, CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true))))
julia> CUDAnative.registers(kernel)
22
julia> CUDAnative.memory(kernel)
(local = 104, shared = 0, constant = 0)
julia> CUDAnative.maxthreads(kernel)
1024
Or simply use the occupancy API to have CUDA pick a number of threads, and update your indexing to be able to handle that:
# adjust the kernel to perform a bounds check
function vadd(a, b, c)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
if i <= length(c)
c[i] = a[i] + b[i]
end
return
end
# generate data
...
# old hardcoded launch
len = prod(dims)
#@cuda threads=len vadd(d_a, d_b, d_c)
# instead use a callback to query the allowed & optimal number of threads
function get_config(kernel)
fun = kernel.fun
config = launch_configuration(fun)
blocks = cld(len, config.threads)
return (threads=config.threads, blocks=blocks)
end
@cuda config=get_config vadd(d_a, d_b, d_c)