Hi, I’m trying to use CUDA.jl and CUBLAS to implement a mod N matrix multiplication using Algorithm 1 of this paper (see the HAL preprint). The inner loop of the algorithm requires alternating calls to CUBLAS with a broadcast kernel call (to reduce the entries of a matrix modulo N).
When I run this on a single thread, it works and gives a nice speedup compared to existing CPU implementations.
However, if I try to run this on multiple threads, there are a bunch of lock conflicts, which seems to be coming from the fact that CUDA.jl kernels are always launched in sequence (see here, here).
However, as far as I can tell from stuff written about CUDA C, CUDA streams seem to allow one to execute kernels in parallel? And the CUDA.jl docs say that each thread in a @threads
macro is given it’s own cuda stream.
So I have two questions: first, why does CUDA.jl need to have locks for kernel launches? Second, is there a standard way to overcome that in a situation like mine?