How one can choose the optimal arguments (threads, blocks, etc.) to launch CUDA kernels using @cuda
macro? In the following example I consider a simple kernel which does vector addition of two matrices:
using CuArrays
using CUDAdrv
using CUDAnative
const MAX_THREADS_PER_BLOCK = CUDAdrv.attribute(
CUDAnative.CuDevice(0), CUDAdrv.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
)
function kernel(a, b, c)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
NN = length(a)
for i=id:stride:NN
a[i] = b[i] + c[i]
end
return nothing
end
function foo1(a, b, c)
@cuda kernel(a, b, c)
return nothing
end
function foo2(a, b, c)
NN = length(a)
nth = min(NN, MAX_THREADS_PER_BLOCK)
nbl = cld(NN, nth)
@cuda blocks=nbl threads=nth kernel(a, b, c)
return nothing
end
function foo3(a, b, c)
function get_config(kernel)
fun = kernel.fun
config = launch_configuration(fun)
blocks = cld(NN, config.threads)
return (threads=config.threads, blocks=blocks)
end
NN = length(a)
@cuda config=get_config kernel(a, b, c)
return nothing
end
function foo4(a, b, c)
function get_config(kernel)
fun = kernel.fun
config = launch_configuration(fun)
return (threads=config.threads, blocks=config.blocks)
end
@cuda config=get_config kernel(a, b, c)
return nothing
end
function main()
N = 2000
a = CuArrays.rand(Float32, (N, N))
b = CuArrays.rand(Float32, (N, N))
c = CuArrays.rand(Float32, (N, N))
NN = length(a)
cukernel = cufunction(kernel, NTuple{3,CuDeviceArray{Float32,2,AS.Global}})
config = launch_configuration(cukernel.fun)
foo2_threads = min(NN, MAX_THREADS_PER_BLOCK)
foo2_blocks = cld(NN, foo2_threads)
foo3_threads = config.threads
foo3_blocks = cld(NN, config.threads)
foo4_threads = config.threads
foo4_blocks = config.blocks
println()
println("foo1 (plain @cuda):")
CuArrays.@time foo1(a, b, c)
CuArrays.@time foo1(a, b, c)
CuArrays.@time foo1(a, b, c)
println()
println("foo2 (threads=$foo2_threads blocks=$foo2_blocks):")
CuArrays.@time foo2(a, b, c)
CuArrays.@time foo2(a, b, c)
CuArrays.@time foo2(a, b, c)
println()
println("foo3 (threads=$foo3_threads blocks=$foo3_blocks):")
CuArrays.@time foo3(a, b, c)
CuArrays.@time foo3(a, b, c)
CuArrays.@time foo3(a, b, c)
println()
println("foo4 (threads=$foo4_threads blocks=$foo4_blocks):")
CuArrays.@time foo4(a, b, c)
CuArrays.@time foo4(a, b, c)
CuArrays.@time foo4(a, b, c)
return nothing
end
main()
On my machine the result is:
foo1 (plain @cuda):
1.372698 seconds (32.08 k CPU allocations: 1.639 MiB)
1.221025 seconds (42 CPU allocations: 1.219 KiB)
1.214232 seconds (42 CPU allocations: 1.219 KiB)
foo2 (threads=1024 blocks=3907):
0.002800 seconds (52 CPU allocations: 1.375 KiB)
0.002400 seconds (52 CPU allocations: 1.375 KiB)
0.002305 seconds (52 CPU allocations: 1.375 KiB)
foo3 (threads=640 blocks=6250):
0.097974 seconds (87.40 k CPU allocations: 4.605 MiB)
0.003157 seconds (96 CPU allocations: 3.281 KiB)
0.002521 seconds (96 CPU allocations: 3.281 KiB)
foo4 (threads=640 blocks=32):
0.015719 seconds (12.93 k CPU allocations: 666.858 KiB)
0.000456 seconds (47 CPU allocations: 1.297 KiB)
0.000401 seconds (47 CPU allocations: 1.297 KiB)
As I understand there are a number of approaches to choose the number of blocks and threads for @cuda
macro:
- Like in function foo1, use defaults of
@cuda
macro. - Like in function foo2, using DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK attribute of the device. This approach does not work for all cases, since there are situations where I can hit the " too many resources requested" error due to extensive use of registers.
- Like in function foo3 where the number of threads is calculated with
launch_configuration
function and then number of blocks is calculated from the length of the input arrays. This approach is suggested here: Synchronizing Cuda kernels - #3 by maleadt and used in, e.g. CuArrays (CuArrays.jl/indexing.jl at c58cabc3d82c2f882bd0dc2d29457e2f247a5ed9 · JuliaGPU/CuArrays.jl · GitHub). - Like in function foo4 where both number of threads and blocks come from
launch_configuration
.
The results of the above launch suggest that the fastest run was achieved with approach 4. I guess, since other packages use approach 3, in general it is better or more safe.
To sum up, my questions are:
- What is the most general approach to choose the launch arguments for
@cuda
macro? - Apart of threads and blocks arguments, what are the other important arguments which can significantly affect the perfomance? Shared memory?
- Why
@cuda
macro by default does not estimate the needed number of threads and blocks to maximize the perfomance?