I have troubles to make streams overlap. So, I created the following simple example, where the streams
s2 should overlap; yet they don’t:
using CUDAdrv, CUDAnative, CuArrays function memcopy!(A, B) ix = (blockIdx().x-1) * blockDim().x + threadIdx().x A[ix] = B[ix] return nothing end nx = 128*1024^2 nt = 100 A = cuzeros(nx); B = cuones(nx); C = cuzeros(nx); D = cuones(nx); nthreads = 1024 nblocks = ceil(Int, nx/nthreads) s1 = CuStream(CUDAdrv.STREAM_NON_BLOCKING); s2 = CuStream(CUDAdrv.STREAM_NON_BLOCKING); for it = 1:nt @cuda blocks=nblocks threads=nthreads stream=s1 memcopy!(A, B); @cuda blocks=nblocks threads=nthreads stream=s2 memcopy!(C, D); CUDAdrv.synchronize() end
Here is a screenshot from the analysis with nvvp:
What am I missing?