This CUDA tutorial demonstrates how multiple cpu threads can be spawned and each will have its own stream, which achieves great concurrency on the GPU (see section " A Multi-threading Example") https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ . Ultimately, I’d like to have a multi-threaded program where each thread can spawn kernels on its own stream, and can sync/wait for them.
I’m trying to replicate this by modifying the MWE from this other active thread CUDA streams do not overlap. I’m just adding
My code and output are:
using CUDAdrv, CUDAnative, CuArrays function memcopy!(A, B) ix = (blockIdx().x-1) * blockDim().x + threadIdx().x A[ix] = B[ix] return end function main() nx = 128*1024^2 nt = 100 nthreads = 1024 nblocks = ceil(Int, nx/nthreads) Threads.@threads for i = 1:2 A = CuArray(zeros(nx)) B = CuArray(ones(nx)) s = CuStream() @cuda blocks=nblocks threads=nthreads stream=s memcopy!(A, B); CUDAdrv.synchronize(s) end end main()
Which, only with the
Error thrown in threaded loop on thread 1: CUDAdrv.CuError(code=201, meta=nothing) julia>
Without multiple threads, I have no problem. Is there a better way I should be either multithreading or using the CUDAdrv/native packages?