CUDA streams do not overlap

Reproducing the CUDA C experiments with Julia

With the stream priority feature, we can now reproduce the CUDA C experiments from above in Julia…

So the code becomes:

using CUDAdrv, CUDAnative, CuArrays

function CUDAdrv.CuStream(priority::Integer, flags::CUDAdrv.CUstream_flags=CUDAdrv.STREAM_DEFAULT)
    handle_ref = Ref{CUDAdrv.CuStream_t}()
    CUDAdrv.@apicall(:cuStreamCreateWithPriority , (Ptr{CUDAdrv.CuStream_t}, Cuint, Cint),
                                                   handle_ref, flags, priority)

    ctx = CuCurrentContext()
    obj = CuStream(handle_ref[], ctx)
    finalizer(CUDAdrv.unsafe_destroy!, obj)
    return obj
end

priorityRange() = (r1_ref = Ref{Cint}(); r2_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuCtxGetStreamPriorityRange, (Ptr{Cint}, Ptr{Cint}), r1_ref, r2_ref); (r1_ref[], r2_ref[]))

priority(s::CuStream) = (prio_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuStreamGetPriority, (CUDAdrv.CuStream_t, Ptr{Cint}), s, prio_ref); prio_ref[])

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing
end

nx = 128*1024^2
nt = 100
A = cuzeros(nx);
B = cuones(nx);
C = cuzeros(nx);
D = cuones(nx);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
p_min, p_max = priorityRange();
s1 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING); 
priority(s1)
priority(s2)

for it = 1:nt
    @cuda blocks=nblocks threads=nthreads stream=s1 memcopy!(A, B);
    @cuda blocks=nblocks threads=nthreads stream=s2 memcopy!(C, D);
    CUDAdrv.synchronize()
end

Using a higher priority for the second stream than for the first stream makes the streams overlap:

Using however lower priority for the second stream

s1 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING); 

makes the second stream start only when the first stream is nearly finished (i.e. starting to use less GPU resources):

Conclusion

As in CUDA C, stream priorities enable to overlap streams that saturate the GPU resources. The total runtime is not reduced due to that, but it enables e.g. quickly copying some array boundaries for a halo update while a computation kernel is beeing exectuted (my use case of streams).
Streams in CuArrays/CUDAnative/CUDAdrv were observed to behave as expected as in CUDA C.

1 Like