I have a system with some kernels which are repeatatly called in a loop. My kernels operate on the same input data but in know at launchtime which kernels are safe to spawn asynchronously and which aren’t. A MWE would look something like this
using Pkg
pkg"activate --temp"
pkg"add KernelAbstractions, CUDA"
using KernelAbstractions, CUDA
CUDA.functional()
@kernel function kernel1(u, offset)
I = @index(Global) + offset
u[I] = u[I] + 1
end
@kernel function kernel2(u, offset)
I = @index(Global) + offset
u[I] = u[I] + 2
end
function loop(u)
backend = get_backend(u)
_kernel1 = kernel1(backend)
_kernel1(u, 0; ndrange=1000)
_kernel2 = kernel2(backend)
_kernel2(u, 1000; ndrange=1000)
KernelAbstractions.synchronize(backend)
u
end
x = cu(ones(2000))
loop(x)
When I inspect this code with nsys, it looks like the kernels are launched one after another and not in parallel. Naivly I thought a kernel launch is always async until you put a synchronize
in between. Is it possible to achieve the desired behavior?
My real world example is more complex. I want to launch some kernels in parallel, wait for all of them to finish and then launch a different set which depends on the results on the first call. It seems like CUDA Graph might do what I want, but I guess just launching two in parallel would be a required first step.