CUDA.jl calling kernels in parallel?

jjgarzella · October 10, 2025, 12:28am

Hi, I’m trying to use CUDA.jl and CUBLAS to implement a mod N matrix multiplication using Algorithm 1 of this paper (see the HAL preprint). The inner loop of the algorithm requires alternating calls to CUBLAS with a broadcast kernel call (to reduce the entries of a matrix modulo N).

When I run this on a single thread, it works and gives a nice speedup compared to existing CPU implementations.

However, if I try to run this on multiple threads, there are a bunch of lock conflicts, which seems to be coming from the fact that CUDA.jl kernels are always launched in sequence (see here, here).

However, as far as I can tell from stuff written about CUDA C, CUDA streams seem to allow one to execute kernels in parallel? And the CUDA.jl docs say that each thread in a @threads macro is given it’s own cuda stream.

So I have two questions: first, why does CUDA.jl need to have locks for kernel launches? Second, is there a standard way to overcome that in a situation like mine?

maleadt · October 11, 2025, 5:52am

Only within a single task. Given that you’re using multiple threads, presumably using Julia tasks, each of those should have its own stream and thus allow concurrent execution.

A lock is taken when looking up the function from the compilation cache, but that should be very quick, and doesn’t encompass the actual launch. So launching kernels itself doesn’t take a lock.

Topic		Replies	Views
CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm GPU parallel , multithreading , cuda , concurrency	3	1800	April 26, 2022
Using stream per cpu thread pattern GPU	1	922	June 8, 2019
CUDAnative use multiple GPUs GPU gpu , cudanative , parallel	5	1800	March 24, 2018
External processes called by Julia locked to 100% cpu use, even when run in parallel? Julia at Scale ccall	1	991	February 23, 2018
Synchronizing Cuda kernels GPU	5	2478	September 20, 2019

CUDA.jl calling kernels in parallel?

Related topics