The most general way to estimate the optimal arguments for @cuda macro

According to this reply, the current CUDA API proposes the following approach to choose the number of threads and blocks needed to launch kernels:

using CUDA


function kernel(a, b)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    N = length(a)
    for k=id:stride:N
        a[k] = b[k]
    end
    return nothing
end


N = 1024
a = CUDA.zeros(N)
b = CUDA.rand(N)

ckernel = @cuda launch=false kernel(a, b)
config = launch_configuration(ckernel.fun)
threads = min(N, config.threads)
blocks =  cld(N, threads)
ckernel(a, b; threads=threads, blocks=blocks)

@assert a == b

The corresponding code looks a bit bulky for me.
Would it be more convenient to wrap it into a macro? Something like

macro krun(ex...)
    len = ex[1]
    call = ex[2]

    args = call.args[2:end]

    @gensym kernel config threads blocks
    code = quote
        local $kernel = @cuda launch=false $call
        local $config = launch_configuration($kernel.fun)
        local $threads = min($len, $config.threads)
        local $blocks = cld($len, $threads)
        $kernel($(args...); threads=$threads, blocks=$blocks)
    end

    return esc(code)
end


@krun N kernel(a, b)

@assert a == b

Then it will be possible to launch kernels with a single argument, which corresponds to a number of required parallel processes: @krun N kernel(a, b).