Parallel launch of CUDA kernels

I have a system with some kernels which are repeatatly called in a loop. My kernels operate on the same input data but in know at launchtime which kernels are safe to spawn asynchronously and which aren’t. A MWE would look something like this

using Pkg
pkg"activate --temp"
pkg"add KernelAbstractions, CUDA"

using KernelAbstractions, CUDA
CUDA.functional()

@kernel function kernel1(u, offset)
    I = @index(Global) + offset
    u[I] = u[I] + 1
end

@kernel function kernel2(u, offset)
    I = @index(Global) + offset
    u[I] = u[I] + 2
end

function loop(u)
    backend = get_backend(u)

    _kernel1 = kernel1(backend)
    _kernel1(u, 0; ndrange=1000)

    _kernel2 = kernel2(backend)
    _kernel2(u, 1000; ndrange=1000)

    KernelAbstractions.synchronize(backend)
    u
end

x = cu(ones(2000))
loop(x)

When I inspect this code with nsys, it looks like the kernels are launched one after another and not in parallel. Naivly I thought a kernel launch is always async until you put a synchronize in between. Is it possible to achieve the desired behavior?

My real world example is more complex. I want to launch some kernels in parallel, wait for all of them to finish and then launch a different set which depends on the results on the first call. It seems like CUDA Graph might do what I want, but I guess just launching two in parallel would be a required first step.

Kernel launches are asynchronous with respect to the host, but are executed in-order with respect to each other.

You can use Julia tasks to model that concurrency,

1 Like

Wait, do you mean I can achieve on-device parallelism by launching the kernels from different Tasks on the Host? That’s my goal in the end: execute multiple kernels on the same data in parallel on the GPU…

Yeah exactly. Tasks and threads · CUDA.jl

1 Like

Thanks for the link, this makes sense now! I guess alternatively I could create the needed number of CUDA streams manually and launching kernels on different streams from the same task/thread with possibly less overhead?

Follow-up question: Since my kernels are relatively small, I think it would be best to prepare them as a cuda graph to reduce overhead. Suppose my (limited experiments) with CUDA.jl Graph Execution are correct, this isn’t possible using the @captured macro because itseems to only capture a single stream in a linear graph A->B->C->… Is there an equivalent to the c-API for building graphs manually?

Yes, you should check the CUDA Driver API, all of which are available in CUDA.jl. So you can just call CUDA.cuGraphAddKernelNode_v2. Although it’d be nice to have higher-level abstractions, so if you have any successes here, consider creating a PR with any abstractions you create.

1 Like