Parallel launch of CUDA kernels

hexaeder · November 12, 2024, 8:50am

I have a system with some kernels which are repeatatly called in a loop. My kernels operate on the same input data but in know at launchtime which kernels are safe to spawn asynchronously and which aren’t. A MWE would look something like this

using Pkg
pkg"activate --temp"
pkg"add KernelAbstractions, CUDA"

using KernelAbstractions, CUDA
CUDA.functional()

@kernel function kernel1(u, offset)
    I = @index(Global) + offset
    u[I] = u[I] + 1
end

@kernel function kernel2(u, offset)
    I = @index(Global) + offset
    u[I] = u[I] + 2
end

function loop(u)
    backend = get_backend(u)

    _kernel1 = kernel1(backend)
    _kernel1(u, 0; ndrange=1000)

    _kernel2 = kernel2(backend)
    _kernel2(u, 1000; ndrange=1000)

    KernelAbstractions.synchronize(backend)
    u
end

x = cu(ones(2000))
loop(x)

When I inspect this code with nsys, it looks like the kernels are launched one after another and not in parallel. Naivly I thought a kernel launch is always async until you put a synchronize in between. Is it possible to achieve the desired behavior?

My real world example is more complex. I want to launch some kernels in parallel, wait for all of them to finish and then launch a different set which depends on the results on the first call. It seems like CUDA Graph might do what I want, but I guess just launching two in parallel would be a required first step.

vchuravy · November 12, 2024, 9:42am

Kernel launches are asynchronous with respect to the host, but are executed in-order with respect to each other.

You can use Julia tasks to model that concurrency,

hexaeder · November 12, 2024, 2:17pm

Wait, do you mean I can achieve on-device parallelism by launching the kernels from different Tasks on the Host? That’s my goal in the end: execute multiple kernels on the same data in parallel on the GPU…

vchuravy · November 12, 2024, 3:38pm

Yeah exactly. Tasks and threads · CUDA.jl

hexaeder · November 12, 2024, 5:20pm

Thanks for the link, this makes sense now! I guess alternatively I could create the needed number of CUDA streams manually and launching kernels on different streams from the same task/thread with possibly less overhead?

Follow-up question: Since my kernels are relatively small, I think it would be best to prepare them as a cuda graph to reduce overhead. Suppose my (limited experiments) with CUDA.jl Graph Execution are correct, this isn’t possible using the @captured macro because itseems to only capture a single stream in a linear graph A->B->C->… Is there an equivalent to the c-API for building graphs manually?

maleadt · November 13, 2024, 1:14pm

Yes, you should check the CUDA Driver API, all of which are available in CUDA.jl. So you can just call CUDA.cuGraphAddKernelNode_v2. Although it’d be nice to have higher-level abstractions, so if you have any successes here, consider creating a PR with any abstractions you create.

Topic		Replies	Views
Asynchronous kernel scheduling with KernelAbstractions GPU	6	315	July 3, 2025
With ParallelStencil, is it possible to launch multiple kernels and sync later? GPU	7	103	April 9, 2025
CUDA async is not working properly New to Julia gpu , cuda	4	166	December 31, 2024
CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm GPU parallel , multithreading , cuda , concurrency	3	1753	April 26, 2022
How to use multiple GPUs correctly? GPU question	2	2745	October 16, 2019

Parallel launch of CUDA kernels

Related topics