CUDA streams do not overlap

I have troubles to make streams overlap. So, I created the following simple example, where the streams s1 and s2 should overlap; yet they don’t:

using CUDAdrv, CUDAnative, CuArrays

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing
end

nx = 128*1024^2
nt = 100
A = cuzeros(nx);
B = cuones(nx);
C = cuzeros(nx);
D = cuones(nx);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
s1 = CuStream(CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(CUDAdrv.STREAM_NON_BLOCKING);

for it = 1:nt
    @cuda blocks=nblocks threads=nthreads stream=s1 memcopy!(A, B);
    @cuda blocks=nblocks threads=nthreads stream=s2 memcopy!(C, D);
    CUDAdrv.synchronize()
end

Here is a screenshot from the analysis with nvvp:

What am I missing?

Thanks!

1 Like

You might be saturating the GPU; you won’t see any overlap then. There’s many reports like these online about kernels not overlapping, maybe try and start with a known working example before porting it to CUDAdrv/CUDAnative. Seeing how the kernels are launched on independent streams, everything seems to be working from an API point of view.

Thanks, I will investigate this therefore starting from CUDA C examples…

You don;t necessarily need to start from CUDA C, but fomr a set of kernels and a corresponding launch configuration that you can overlap on your GPU. If any of your kernels is exhausting a resource, it’s impossible to overlap.