The most general way to estimate the optimal arguments for @cuda macro

How one can choose the optimal arguments (threads, blocks, etc.) to launch CUDA kernels using @cuda macro? In the following example I consider a simple kernel which does vector addition of two matrices:

using CuArrays
using CUDAdrv
using CUDAnative

const MAX_THREADS_PER_BLOCK = CUDAdrv.attribute(
   CUDAnative.CuDevice(0), CUDAdrv.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
)

function kernel(a, b, c)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    NN = length(a)
    for i=id:stride:NN
        a[i] = b[i] + c[i]
    end
    return nothing
end


function foo1(a, b, c)
    @cuda kernel(a, b, c)
    return nothing
end


function foo2(a, b, c)
    NN = length(a)
    nth = min(NN, MAX_THREADS_PER_BLOCK)
    nbl = cld(NN, nth)
    @cuda blocks=nbl threads=nth kernel(a, b, c)
    return nothing
end
    
    
function foo3(a, b, c)
    function get_config(kernel)
        fun = kernel.fun
        config = launch_configuration(fun)
        blocks = cld(NN, config.threads)
        return (threads=config.threads, blocks=blocks)
    end
    NN = length(a)
    @cuda config=get_config kernel(a, b, c)
    return nothing
end
    
    
function foo4(a, b, c)
    function get_config(kernel)
        fun = kernel.fun
        config = launch_configuration(fun)
        return (threads=config.threads, blocks=config.blocks)
    end
    @cuda config=get_config kernel(a, b, c)
    return nothing
end


function main()
    N = 2000
    a = CuArrays.rand(Float32, (N, N))
    b = CuArrays.rand(Float32, (N, N))
    c = CuArrays.rand(Float32, (N, N))

    NN = length(a)
    
    cukernel = cufunction(kernel, NTuple{3,CuDeviceArray{Float32,2,AS.Global}})
    config = launch_configuration(cukernel.fun)
    
    foo2_threads = min(NN, MAX_THREADS_PER_BLOCK)
    foo2_blocks = cld(NN, foo2_threads)

    foo3_threads = config.threads
    foo3_blocks = cld(NN, config.threads)
    
    foo4_threads = config.threads
    foo4_blocks = config.blocks
    
    println()
    println("foo1 (plain @cuda):")
    CuArrays.@time foo1(a, b, c)
    CuArrays.@time foo1(a, b, c)
    CuArrays.@time foo1(a, b, c)
    
    println()
    println("foo2 (threads=$foo2_threads blocks=$foo2_blocks):")
    CuArrays.@time foo2(a, b, c)
    CuArrays.@time foo2(a, b, c)
    CuArrays.@time foo2(a, b, c)
    
    println()
    println("foo3 (threads=$foo3_threads blocks=$foo3_blocks):")
    CuArrays.@time foo3(a, b, c)
    CuArrays.@time foo3(a, b, c)
    CuArrays.@time foo3(a, b, c)
    
    println()
    println("foo4 (threads=$foo4_threads blocks=$foo4_blocks):")
    CuArrays.@time foo4(a, b, c)
    CuArrays.@time foo4(a, b, c)
    CuArrays.@time foo4(a, b, c)
    
    return nothing
end

main()

On my machine the result is:

foo1 (plain @cuda):
  1.372698 seconds (32.08 k CPU allocations: 1.639 MiB)
  1.221025 seconds (42 CPU allocations: 1.219 KiB)
  1.214232 seconds (42 CPU allocations: 1.219 KiB)

foo2 (threads=1024 blocks=3907):
  0.002800 seconds (52 CPU allocations: 1.375 KiB)
  0.002400 seconds (52 CPU allocations: 1.375 KiB)
  0.002305 seconds (52 CPU allocations: 1.375 KiB)

foo3 (threads=640 blocks=6250):
  0.097974 seconds (87.40 k CPU allocations: 4.605 MiB)
  0.003157 seconds (96 CPU allocations: 3.281 KiB)
  0.002521 seconds (96 CPU allocations: 3.281 KiB)

foo4 (threads=640 blocks=32):
  0.015719 seconds (12.93 k CPU allocations: 666.858 KiB)
  0.000456 seconds (47 CPU allocations: 1.297 KiB)
  0.000401 seconds (47 CPU allocations: 1.297 KiB)

As I understand there are a number of approaches to choose the number of blocks and threads for @cuda macro:

  1. Like in function foo1, use defaults of @cuda macro.
  2. Like in function foo2, using DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK attribute of the device. This approach does not work for all cases, since there are situations where I can hit the " too many resources requested" error due to extensive use of registers.
  3. Like in function foo3 where the number of threads is calculated with launch_configuration function and then number of blocks is calculated from the length of the input arrays. This approach is suggested here: Synchronizing Cuda kernels - #3 by maleadt and used in, e.g. CuArrays (CuArrays.jl/indexing.jl at c58cabc3d82c2f882bd0dc2d29457e2f247a5ed9 · JuliaGPU/CuArrays.jl · GitHub).
  4. Like in function foo4 where both number of threads and blocks come from launch_configuration.

The results of the above launch suggest that the fastest run was achieved with approach 4. I guess, since other packages use approach 3, in general it is better or more safe.
To sum up, my questions are:

  • What is the most general approach to choose the launch arguments for @cuda macro?
  • Apart of threads and blocks arguments, what are the other important arguments which can significantly affect the perfomance? Shared memory?
  • Why @cuda macro by default does not estimate the needed number of threads and blocks to maximize the perfomance?
2 Likes

Option 3. The blocks returned by the occupancy API is the minimum amount of blocks needed to saturate the GPU. If you don’t have that much work though, it’s useless to spawn more blocks. However, you can use the information to restructure your algorithm, e.g., like how CuArrays’ mapreduce has a loop that processes multiple items per thread but doesn’t if that would prevent saturating the GPU.

The @cuda macro cannot and does not estimate anything because it does not assume that your inputs are arrays. If you don’t provide a launch configuration, you will only get a single thread on a single block.

2 Likes

Thank you for the fast answer.
Should I care about shmem argument if, in advance, I know nothing about the input arrays of my kernel?

No, the shmem argument is for the case where you use dynamic shared memory, which you would pass to @cuda with the shmem argument too. Admittedly, not really documented, but it pretty directly the CUDA C occupancy API we’re wrapping here.

Thank you.

According to this reply, the current CUDA API proposes the following approach to choose the number of threads and blocks needed to launch kernels:

using CUDA


function kernel(a, b)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    N = length(a)
    for k=id:stride:N
        a[k] = b[k]
    end
    return nothing
end


N = 1024
a = CUDA.zeros(N)
b = CUDA.rand(N)

ckernel = @cuda launch=false kernel(a, b)
config = launch_configuration(ckernel.fun)
threads = min(N, config.threads)
blocks =  cld(N, threads)
ckernel(a, b; threads=threads, blocks=blocks)

@assert a == b

The corresponding code looks a bit bulky for me.
Would it be more convenient to wrap it into a macro? Something like

macro krun(ex...)
    len = ex[1]
    call = ex[2]

    args = call.args[2:end]

    @gensym kernel config threads blocks
    code = quote
        local $kernel = @cuda launch=false $call
        local $config = launch_configuration($kernel.fun)
        local $threads = min($len, $config.threads)
        local $blocks = cld($len, $threads)
        $kernel($(args...); threads=$threads, blocks=$blocks)
    end

    return esc(code)
end


@krun N kernel(a, b)

@assert a == b

Then it will be possible to launch kernels with a single argument, which corresponds to a number of required parallel processes: @krun N kernel(a, b).

CUDA.jl is intended to be the low-level interface to the CUDA APIs, and the logic you put in that macro is very dependent on how the kernel you wrote works. So it’s not possible to generalize this at the kernel level. You could of course use broadcast, which sits at a higher level, so it can reason about all this itself. If you insist on writing a kernel-like function, you might be interested in KernelAbstractions.jl