Faster small CUDA memory transfers (UnifiedMem?)


I’m trying to write a program which launches many kernels, but the launches depend on results of earlier kernels. Sometimes the result means a kernel should not be launched at all. They can be distributed across many streams.

Currently, I store the results of a kernel in a CuArray called c, copy it to the host with a = Array(c), and can work directly with the value(s) in a. Lots of kernels means lots of memory transfers, and these all happen in the default stream. This is a bottleneck.

It looks like Unified Memory might help because it is accessible from host and device, and you can do async transfers in any stream. This example in C++ makes it look like standard array syntax works on host and device. The interface in CUDA.jl is less transparent to me.

I can create a buffer and write to it like:

julia> begin
       using CUDA
       unified = CUDA.Mem.alloc(CUDA.Mem.Unified, 4)
       cuptr = convert(CuPtr{UInt32}, unified)
       CUDA.Mem.set!(cuptr, UInt32(17), UInt32(1))
       ptr = convert(Ptr{UInt32}, unified)
       @info "Value: $(unsafe_load(ptr))"
[ Info: Value: 17

but CUDA.Mem.set! doesn’t seem to be kernel friendly. If this use case isn’t totally insane, any tips for how to use unified memory in a kernel?


There was which would’ve made the interface much friendlier, but it doesn’t seem to have been moved to CUDA.jl…

unsafe_wrap gives me the rest of the functionality I need I think. TBD if the performance is what I’m looking for

using CUDA

function kernel(cuarray)
    cuarray[1] += 3
    return nothing

unified = CUDA.Mem.alloc(CUDA.Mem.Unified, 4)
cuptr = convert(CuPtr{UInt32}, unified)
ptr = convert(Ptr{UInt32}, unified)
cuarray = unsafe_wrap(CuArray{UInt32}, cuptr, 1)
array = unsafe_wrap(Array{UInt32}, ptr, 1)
array[1] = 17
@cuda kernel(cuarray)
println(map(Int, array))

Gives us

[ Info: [20]