Launch_configuration() equivalent for AMDGPU.jl

CUDA has a great feature for sizing threads and blocks, namely launch_configuration(). I rarely manually size my kernel, instead something like:

kernel = @cuda launch=false myfunc(args...)
config = launch_configuration(kernel.fun)
threads = min(N, config.threads)
blocks = cld(N, threads)
kernel(args...; threads, blocks)

It’s almost always very close to optimal, and allows my code to move from device to device without worrying too much about launch parameters.

However, from what I can tell ROCm doesn’t have this, and neither does AMDGPU.jl. As a downstream consequence, neither does KernelAbstractions.jl.

So I guess my question is: how should I be sizing my ROCm kernels in a way that is fairly optimal and will work across a range of difference AMD devices?

1 Like

This is a great question, that I don’t currently have a good answer to. As it stands, HIP has their own occupancy analysis which they use to determine a good launch configuration, but since HIP is C++, we likely can’t access this directly (even though the calls themselves are available to C, the arguments expected are not things we generate in AMDGPU.jl).

We can parse the output from the compiler and use that to determine what our resource usage looks like, and since HIP is open source, we can probably borrow their implementation (HIP is MIT-licensed). I’ve managed to find its source, so someone just needs to port it to AMDGPU.jl.

2 Likes

@jpsamaroo Thanks for your reply Julian - porting that is something I’m happy to take on. I’ll be in touch.

1 Like