This CUDA tutorial demonstrates how multiple cpu threads can be spawned and each will have its own stream, which achieves great concurrency on the GPU (see section " A Multi-threading Example") GPU Pro Tip: CUDA 7 Streams Simplify Concurrency | NVIDIA Technical Blog . Ultimately, I’d like to have a multi-threaded program where each thread can spawn kernels on its own stream, and can sync/wait for them.
I’m trying to replicate this by modifying the MWE from this other active thread CUDA streams do not overlap. I’m just adding
My code and output are:
using CUDAdrv, CUDAnative, CuArrays
function memcopy!(A, B)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
A[ix] = B[ix]
return
end
function main()
nx = 128*1024^2
nt = 100
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
Threads.@threads for i = 1:2
A = CuArray(zeros(nx))
B = CuArray(ones(nx))
s = CuStream()
@cuda blocks=nblocks threads=nthreads stream=s memcopy!(A, B);
CUDAdrv.synchronize(s)
end
end
main()
Which, only with the Threads.@threads
, produces:
Error thrown in threaded loop on thread 1: CUDAdrv.CuError(code=201, meta=nothing)
julia>
Without multiple threads, I have no problem. Is there a better way I should be either multithreading or using the CUDAdrv/native packages?
Thanks!