CUDA.jl calling kernels in parallel?

jjgarzella · October 10, 2025, 12:28am

Hi, I’m trying to use CUDA.jl and CUBLAS to implement a mod N matrix multiplication using Algorithm 1 of this paper (see the HAL preprint). The inner loop of the algorithm requires alternating calls to CUBLAS with a broadcast kernel call (to reduce the entries of a matrix modulo N).

When I run this on a single thread, it works and gives a nice speedup compared to existing CPU implementations.

However, if I try to run this on multiple threads, there are a bunch of lock conflicts, which seems to be coming from the fact that CUDA.jl kernels are always launched in sequence (see here, here).

However, as far as I can tell from stuff written about CUDA C, CUDA streams seem to allow one to execute kernels in parallel? And the CUDA.jl docs say that each thread in a @threads macro is given it’s own cuda stream.

So I have two questions: first, why does CUDA.jl need to have locks for kernel launches? Second, is there a standard way to overcome that in a situation like mine?

maleadt · October 11, 2025, 5:52am

Only within a single task. Given that you’re using multiple threads, presumably using Julia tasks, each of those should have its own stream and thus allow concurrent execution.

A lock is taken when looking up the function from the compilation cache, but that should be very quick, and doesn’t encompass the actual launch. So launching kernels itself doesn’t take a lock.

Topic		Replies	Views
CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm GPU parallel , multithreading , cuda , concurrency	3	1820	April 26, 2022
Using stream per cpu thread pattern GPU	1	938	June 8, 2019
Parallel launch of CUDA kernels GPU cuda , kernelabstractions	5	469	November 13, 2024
Call libcuda cuLaunchKernel from Julia New to Julia cuda , c	2	173	January 5, 2025
Julia -> C function (Create thead) -> Julia CUDA kernel issue New to Julia multithreading , cuda	1	109	July 12, 2024

CUDA.jl calling kernels in parallel?

Related topics