CUDA streams do not overlap

samo · July 1, 2019, 10:12am

Reproducing the CUDA C experiments with Julia

With the stream priority feature, we can now reproduce the CUDA C experiments from above in Julia…

So the code becomes:

using CUDAdrv, CUDAnative, CuArrays

function CUDAdrv.CuStream(priority::Integer, flags::CUDAdrv.CUstream_flags=CUDAdrv.STREAM_DEFAULT)
    handle_ref = Ref{CUDAdrv.CuStream_t}()
    CUDAdrv.@apicall(:cuStreamCreateWithPriority , (Ptr{CUDAdrv.CuStream_t}, Cuint, Cint),
                                                   handle_ref, flags, priority)

    ctx = CuCurrentContext()
    obj = CuStream(handle_ref[], ctx)
    finalizer(CUDAdrv.unsafe_destroy!, obj)
    return obj
end

priorityRange() = (r1_ref = Ref{Cint}(); r2_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuCtxGetStreamPriorityRange, (Ptr{Cint}, Ptr{Cint}), r1_ref, r2_ref); (r1_ref[], r2_ref[]))

priority(s::CuStream) = (prio_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuStreamGetPriority, (CUDAdrv.CuStream_t, Ptr{Cint}), s, prio_ref); prio_ref[])

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing
end

nx = 128*1024^2
nt = 100
A = cuzeros(nx);
B = cuones(nx);
C = cuzeros(nx);
D = cuones(nx);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
p_min, p_max = priorityRange();
s1 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING); 
priority(s1)
priority(s2)

for it = 1:nt
    @cuda blocks=nblocks threads=nthreads stream=s1 memcopy!(A, B);
    @cuda blocks=nblocks threads=nthreads stream=s2 memcopy!(C, D);
    CUDAdrv.synchronize()
end

Using a higher priority for the second stream than for the first stream makes the streams overlap:

Using however lower priority for the second stream

s1 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING);

makes the second stream start only when the first stream is nearly finished (i.e. starting to use less GPU resources):

Conclusion

As in CUDA C, stream priorities enable to overlap streams that saturate the GPU resources. The total runtime is not reduced due to that, but it enables e.g. quickly copying some array boundaries for a halo update while a computation kernel is beeing exectuted (my use case of streams).
Streams in CuArrays/CUDAnative/CUDAdrv were observed to behave as expected as in CUDA C.

Topic		Replies	Views
Using stream per cpu thread pattern GPU	1	894	June 8, 2019
CUDAnative: examples using CUDA streams? GPU question	8	1673	September 19, 2019
How to perform GPU overlap operations on the custom kernel function? GPU question	8	624	August 26, 2023
Understanding GPU Kernels GPU	4	2576	April 10, 2018
How to create cuda streams with different priorities? GPU question	11	2883	July 15, 2019

CUDA streams do not overlap

Reproducing the CUDA C experiments with Julia

Conclusion

Related topics