Faster small CUDA memory transfers (UnifiedMem?)

unsafe_wrap gives me the rest of the functionality I need I think. TBD if the performance is what I’m looking for

using CUDA

function kernel(cuarray)
    cuarray[1] += 3
    return nothing
end

unified = CUDA.Mem.alloc(CUDA.Mem.Unified, 4)
cuptr = convert(CuPtr{UInt32}, unified)
ptr = convert(Ptr{UInt32}, unified)
cuarray = unsafe_wrap(CuArray{UInt32}, cuptr, 1)
array = unsafe_wrap(Array{UInt32}, ptr, 1)
array[1] = 17
@cuda kernel(cuarray)
synchronize()
println(map(Int, array))

Gives us

[ Info: [20]
julia>