CUDA.jl calling kernels in parallel?

Hi, I’m trying to use CUDA.jl and CUBLAS to implement a mod N matrix multiplication using Algorithm 1 of this paper (see the HAL preprint). The inner loop of the algorithm requires alternating calls to CUBLAS with a broadcast kernel call (to reduce the entries of a matrix modulo N).

When I run this on a single thread, it works and gives a nice speedup compared to existing CPU implementations.

However, if I try to run this on multiple threads, there are a bunch of lock conflicts, which seems to be coming from the fact that CUDA.jl kernels are always launched in sequence (see here, here).

However, as far as I can tell from stuff written about CUDA C, CUDA streams seem to allow one to execute kernels in parallel? And the CUDA.jl docs say that each thread in a @threads macro is given it’s own cuda stream.

So I have two questions: first, why does CUDA.jl need to have locks for kernel launches? Second, is there a standard way to overcome that in a situation like mine?

Only within a single task. Given that you’re using multiple threads, presumably using Julia tasks, each of those should have its own stream and thus allow concurrent execution.

A lock is taken when looking up the function from the compilation cache, but that should be very quick, and doesn’t encompass the actual launch. So launching kernels itself doesn’t take a lock.