According to this reply, the current CUDA API proposes the following approach to choose the number of threads and blocks needed to launch kernels:
using CUDA
function kernel(a, b)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
N = length(a)
for k=id:stride:N
a[k] = b[k]
end
return nothing
end
N = 1024
a = CUDA.zeros(N)
b = CUDA.rand(N)
ckernel = @cuda launch=false kernel(a, b)
config = launch_configuration(ckernel.fun)
threads = min(N, config.threads)
blocks = cld(N, threads)
ckernel(a, b; threads=threads, blocks=blocks)
@assert a == b
The corresponding code looks a bit bulky for me.
Would it be more convenient to wrap it into a macro? Something like
macro krun(ex...)
len = ex[1]
call = ex[2]
args = call.args[2:end]
@gensym kernel config threads blocks
code = quote
local $kernel = @cuda launch=false $call
local $config = launch_configuration($kernel.fun)
local $threads = min($len, $config.threads)
local $blocks = cld($len, $threads)
$kernel($(args...); threads=$threads, blocks=$blocks)
end
return esc(code)
end
@krun N kernel(a, b)
@assert a == b
Then it will be possible to launch kernels with a single argument, which corresponds to a number of required parallel processes: @krun N kernel(a, b)
.