Reproducing the CUDA C experiments with Julia
With the stream priority feature, we can now reproduce the CUDA C experiments from above in Julia…
So the code becomes:
using CUDAdrv, CUDAnative, CuArrays
function CUDAdrv.CuStream(priority::Integer, flags::CUDAdrv.CUstream_flags=CUDAdrv.STREAM_DEFAULT)
handle_ref = Ref{CUDAdrv.CuStream_t}()
CUDAdrv.@apicall(:cuStreamCreateWithPriority , (Ptr{CUDAdrv.CuStream_t}, Cuint, Cint),
handle_ref, flags, priority)
ctx = CuCurrentContext()
obj = CuStream(handle_ref[], ctx)
finalizer(CUDAdrv.unsafe_destroy!, obj)
return obj
end
priorityRange() = (r1_ref = Ref{Cint}(); r2_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuCtxGetStreamPriorityRange, (Ptr{Cint}, Ptr{Cint}), r1_ref, r2_ref); (r1_ref[], r2_ref[]))
priority(s::CuStream) = (prio_ref = Ref{Cint}(); CUDAdrv.@apicall(:cuStreamGetPriority, (CUDAdrv.CuStream_t, Ptr{Cint}), s, prio_ref); prio_ref[])
function memcopy!(A, B)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
A[ix] = B[ix]
return nothing
end
nx = 128*1024^2
nt = 100
A = cuzeros(nx);
B = cuones(nx);
C = cuzeros(nx);
D = cuones(nx);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
p_min, p_max = priorityRange();
s1 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING);
priority(s1)
priority(s2)
for it = 1:nt
@cuda blocks=nblocks threads=nthreads stream=s1 memcopy!(A, B);
@cuda blocks=nblocks threads=nthreads stream=s2 memcopy!(C, D);
CUDAdrv.synchronize()
end
Using a higher priority for the second stream than for the first stream makes the streams overlap:
Using however lower priority for the second stream
s1 = CuStream(p_max, CUDAdrv.STREAM_NON_BLOCKING);
s2 = CuStream(p_min, CUDAdrv.STREAM_NON_BLOCKING);
makes the second stream start only when the first stream is nearly finished (i.e. starting to use less GPU resources):
Conclusion
As in CUDA C, stream priorities enable to overlap streams that saturate the GPU resources. The total runtime is not reduced due to that, but it enables e.g. quickly copying some array boundaries for a halo update while a computation kernel is beeing exectuted (my use case of streams).
Streams in CuArrays/CUDAnative/CUDAdrv were observed to behave as expected as in CUDA C.