CUDA streams do not overlap

In order to understand the issue, I did some experiments with CUDA C.

Experiment 1

I took one of my working (MPI-) CUDA C application which features overlapping streams and simplified it. Now, there is just one compute kernel (green stream) and one kernel that copies boundaries of a 3-D array into a buffer for a halo update (violet stream):

Note
The streams were created with different priorities (see the CUDA C function calls at the end [1]):

  • violet: highest priority
  • green: lowest priority

How can we explain the nvvp screenshot?

  1. The computation kernel is submitted and executed (green stream).
  2. The boundary copy kernel is submitted (violet stream); as it is higher priority than the green stream, it is immediately executed, thus, overlapping the green stream.

This is the desired behaviour, as in the full application it allows to quickly progress with the halo update while the computations (green stream) are ongoing.

Experiment 2 - 5

Then, I made four more experiments. I changed the priorities to have

  • violet: lowest priority
  • green: highest priority

or

  • violet: highest priority
  • green: highest priority

or

  • violet: lowest priority
  • green: lowest priority

or simply:

  • violet: default priority
  • green: default priority

For all these experiments the nvvp screenshot looked then as follows:

How can we explain the nvvp screenshot?

  1. The computation kernel is submitted and executed (green stream).
  2. The boundary copy kernel is submitted (violet stream); as it is the same / lower priority than the green stream, it is however not immediately executed as the green stream already saturates the GPU resources. It is only executed *shortly before the end of the green stream when the GPU resource saturation is winding down.

Summary

  1. The boundary copy kernel (violet stream) needs to be of higher priority than the computation kernel (green stream) in order to be executed immediately.
  2. The total execution time for both kernels is about the same for each experiment [2]. When kernels saturate the resources, forcing them to overlap with priorities does logically not reduce the runtime (while the violet stream is executed, the green stream is probably just on hold…).

Conclusion

It is clear that I do not want to “soften” my compute kernel (make it saturate the GPU less, at the cost of being slower) just to overlap it with the boundary copy kernel without needing to recur to stream priorities. Using stream priorities seems to be absolutely the right way to do it for my use case. Thus, I just need to use stream priorities also in the Julia code (I created a separate topic for that here).

@maleadt, so you are certainly right that the Julia kernels above do not overlap as each of them saturates the GPU resources. Now, I believe the above experiments showed that this is no problem. If one needs to overlap a small kernel with a kernel that saturates the GPU, then one can use the stream priorities for that and obtains exactly what one should have.

Additional information

[1] CUDA C function calls to create highest and lowest priority streams:

cudaStream_t streams[2];
int leastPriority=-1, greatestPriority=-1;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
cudaStreamCreateWithPriority(&streams[0], cudaStreamNonBlocking, greatestPriority);
cudaStreamCreateWithPriority(&streams[1], cudaStreamNonBlocking, leastPriority);

[2] timing:
experiment 1 (full overlap):
green stream: 5.93 ms
(violet stream: 0.06 ms)

experiments 2 and 3:
green stream: 5.87 ms
violet stream: 0.06 ms
total execution time: 5.93 ms