Hey,
I’m trying to write a program which launches many kernels, but the launches depend on results of earlier kernels. Sometimes the result means a kernel should not be launched at all. They can be distributed across many streams.
Currently, I store the results of a kernel in a CuArray
called c
, copy it to the host with a = Array(c)
, and can work directly with the value(s) in a
. Lots of kernels means lots of memory transfers, and these all happen in the default stream. This is a bottleneck.
It looks like Unified Memory might help because it is accessible from host and device, and you can do async transfers in any stream. This example in C++ makes it look like standard array syntax works on host and device. The interface in CUDA.jl is less transparent to me.
I can create a buffer and write to it like:
julia> begin
using CUDA
unified = CUDA.Mem.alloc(CUDA.Mem.Unified, 4)
cuptr = convert(CuPtr{UInt32}, unified)
CUDA.Mem.set!(cuptr, UInt32(17), UInt32(1))
ptr = convert(Ptr{UInt32}, unified)
@info "Value: $(unsafe_load(ptr))"
end
[ Info: Value: 17
julia>
but CUDA.Mem.set!
doesn’t seem to be kernel friendly. If this use case isn’t totally insane, any tips for how to use unified memory in a kernel?
Thanks