Reading or writing doesn’t make any difference. If you allocate a device-mapped host array, you’ll be able to perform ordinary reads and writes while the GPU will be able to read and write that memory too. But again, since that’s host memory the GPU will read over the PCIe bus and those memory operations will be slow. If you want to read/write device memory without having to wait for the kernel to finish, use a separate stream to break the ordering. In that case there’s no guarantees about the validity of the memory contents.
These questions are not specific to CUDA.jl and equally apply to CUDA C, so you can also Google for them. For example, gpgpu - Accessing cuda device memory when the cuda kernel is running - Stack Overflow.