CUDA has a great feature for sizing threads and blocks, namely
launch_configuration(). I rarely manually size my kernel, instead something like:
kernel = @cuda launch=false myfunc(args...) config = launch_configuration(kernel.fun) threads = min(N, config.threads) blocks = cld(N, threads) kernel(args...; threads, blocks)
It’s almost always very close to optimal, and allows my code to move from device to device without worrying too much about launch parameters.
However, from what I can tell ROCm doesn’t have this, and neither does AMDGPU.jl. As a downstream consequence, neither does KernelAbstractions.jl.
So I guess my question is: how should I be sizing my ROCm kernels in a way that is fairly optimal and will work across a range of difference AMD devices?