Your use of register
is confusing, do you want pinned memory and an async memcpy, or do you want to register an existing host pointer and map it into device space?
Here’s an example of the former:
julia> A = zeros(nx);
julia> A_cpuptr = pointer(A)
Ptr{Float64} @0x00007f360f7ff040
julia> A_buf = Mem.register(Mem.Host, A_cpuptr, sizeof(A), Mem.HOSTREGISTER_DEVICEMAP)
CUDAdrv.Mem.HostBuffer(Ptr{Nothing} @0x00007f360f7ff040, 8388608, CuContext(Ptr{Nothing} @0x000000000255dc70, false, true), true)
julia> A_gpuptr = convert(CuPtr{Float64}, A_buf)
CuPtr{Float64}(0x0000000202c40040)
julia> A_d = unsafe_wrap(CuArray, A_gpuptr, size(A));
# proof the devicemap works
julia> A[1] = 42
42
julia> A_d[1]
42.0
A_d
is now a device array bound to a CPU memory allocation. Accessing that memory from the GPU is pretty expensive though, since it incurs PCIE reads.