CUDA streams do not overlap

I have troubles to make streams overlap. So, I created the following simple example, where the streams s1 and s2 should overlap; yet they don’t:

using CUDAdrv, CUDAnative, CuArrays

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing
end

nx = 128*1024^2
nt = 100
A = cuzeros(nx);
B = cuones(nx);
C = cuzeros(nx);
D = cuones(nx);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
s1 = CuStream(CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(CUDAdrv.STREAM_NON_BLOCKING);

for it = 1:nt
    @cuda blocks=nblocks threads=nthreads stream=s1 memcopy!(A, B);
    @cuda blocks=nblocks threads=nthreads stream=s2 memcopy!(C, D);
    CUDAdrv.synchronize()
end

Here is a screenshot from the analysis with nvvp:

What am I missing?

Thanks!

1 Like

You might be saturating the GPU; you won’t see any overlap then. There’s many reports like these online about kernels not overlapping, maybe try and start with a known working example before porting it to CUDAdrv/CUDAnative. Seeing how the kernels are launched on independent streams, everything seems to be working from an API point of view.

Thanks, I will investigate this therefore starting from CUDA C examples…

You don;t necessarily need to start from CUDA C, but fomr a set of kernels and a corresponding launch configuration that you can overlap on your GPU. If any of your kernels is exhausting a resource, it’s impossible to overlap.

In order to understand the issue, I did some experiments with CUDA C.

Experiment 1

I took one of my working (MPI-) CUDA C application which features overlapping streams and simplified it. Now, there is just one compute kernel (green stream) and one kernel that copies boundaries of a 3-D array into a buffer for a halo update (violet stream):

Note
The streams were created with different priorities (see the CUDA C function calls at the end [1]):

  • violet: highest priority
  • green: lowest priority

How can we explain the nvvp screenshot?

  1. The computation kernel is submitted and executed (green stream).
  2. The boundary copy kernel is submitted (violet stream); as it is higher priority than the green stream, it is immediately executed, thus, overlapping the green stream.

This is the desired behaviour, as in the full application it allows to quickly progress with the halo update while the computations (green stream) are ongoing.

Experiment 2 - 5

Then, I made four more experiments. I changed the priorities to have

  • violet: lowest priority
  • green: highest priority

or

  • violet: highest priority
  • green: highest priority

or

  • violet: lowest priority
  • green: lowest priority

or simply:

  • violet: default priority
  • green: default priority

For all these experiments the nvvp screenshot looked then as follows:

How can we explain the nvvp screenshot?

  1. The computation kernel is submitted and executed (green stream).
  2. The boundary copy kernel is submitted (violet stream); as it is the same / lower priority than the green stream, it is however not immediately executed as the green stream already saturates the GPU resources. It is only executed *shortly before the end of the green stream when the GPU resource saturation is winding down.

Summary

  1. The boundary copy kernel (violet stream) needs to be of higher priority than the computation kernel (green stream) in order to be executed immediately.
  2. The total execution time for both kernels is about the same for each experiment [2]. When kernels saturate the resources, forcing them to overlap with priorities does logically not reduce the runtime (while the violet stream is executed, the green stream is probably just on hold…).

Conclusion

It is clear that I do not want to “soften” my compute kernel (make it saturate the GPU less, at the cost of being slower) just to overlap it with the boundary copy kernel without needing to recur to stream priorities. Using stream priorities seems to be absolutely the right way to do it for my use case. Thus, I just need to use stream priorities also in the Julia code (I created a separate topic for that here).

@maleadt, so you are certainly right that the Julia kernels above do not overlap as each of them saturates the GPU resources. Now, I believe the above experiments showed that this is no problem. If one needs to overlap a small kernel with a kernel that saturates the GPU, then one can use the stream priorities for that and obtains exactly what one should have.

Additional information

[1] CUDA C function calls to create highest and lowest priority streams:

cudaStream_t streams[2];
int leastPriority=-1, greatestPriority=-1;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
cudaStreamCreateWithPriority(&streams[0], cudaStreamNonBlocking, greatestPriority);
cudaStreamCreateWithPriority(&streams[1], cudaStreamNonBlocking, leastPriority);

[2] timing:
experiment 1 (full overlap):
green stream: 5.93 ms
(violet stream: 0.06 ms)

experiments 2 and 3:
green stream: 5.87 ms
violet stream: 0.06 ms
total execution time: 5.93 ms

Interesting! Thanks for digging into this.

Reproducing the CUDA C experiments with Julia

With the stream priority feature, we can now reproduce the CUDA C experiments from above in Julia…

So the code becomes:

using CUDAdrv, CUDAnative, CuArrays

function CUDAdrv.CuStream(priority::Integer, flags::CUDAdrv.CUstream_flags=CUDAdrv.STREAM_DEFAULT)
    handle_ref = Ref{CUDAdrv.CuStream_t}()
    CUDAdrv.@apicall(:cuStreamCreateWithPriority , (Ptr{CUDAdrv.CuStream_t}, Cuint, Cint),
                                                   handle_ref, flags, priority)

    ctx = CuCurrentContext()
    obj = CuStream(handle_ref[], ctx)
    finalizer(CUDAdrv.unsafe_destroy!, obj)
    return obj
end

priorityRange() = (r1_ref = Ref{Cint}(); r2_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuCtxGetStreamPriorityRange, (Ptr{Cint}, Ptr{Cint}), r1_ref, r2_ref); (r1_ref[], r2_ref[]))

priority(s::CuStream) = (prio_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuStreamGetPriority, (CUDAdrv.CuStream_t, Ptr{Cint}), s, prio_ref); prio_ref[])

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing
end

nx = 128*1024^2
nt = 100
A = cuzeros(nx);
B = cuones(nx);
C = cuzeros(nx);
D = cuones(nx);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
p_min, p_max = priorityRange();
s1 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING); 
priority(s1)
priority(s2)

for it = 1:nt
    @cuda blocks=nblocks threads=nthreads stream=s1 memcopy!(A, B);
    @cuda blocks=nblocks threads=nthreads stream=s2 memcopy!(C, D);
    CUDAdrv.synchronize()
end

Using a higher priority for the second stream than for the first stream makes the streams overlap:

Using however lower priority for the second stream

s1 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING); 

makes the second stream start only when the first stream is nearly finished (i.e. starting to use less GPU resources):

Conclusion

As in CUDA C, stream priorities enable to overlap streams that saturate the GPU resources. The total runtime is not reduced due to that, but it enables e.g. quickly copying some array boundaries for a halo update while a computation kernel is beeing exectuted (my use case of streams).
Streams in CuArrays/CUDAnative/CUDAdrv were observed to behave as expected as in CUDA C.

1 Like