I am actually using this to pass CuArrays to PyTorch Tensor :

using PyCall
tc = pyimport("torch")
using CuArrays
x=cu(randn(1024,1024))
@time tc.from_numpy(Array(x)).cuda()
0.025017 seconds (62 allocations: 20.00 MiB)
...

Since Array(x) is actually moving x from GPU to CPU, and .cuda() from CPU to GPU, I was wondering if there is a way to stay on the GPU and convert CuArrays to PyTorch tensor without having to go back and forth between CPU and GPU.

Directly applying tc.from_numpy() on CuArrays is not working and gives a SegFault

Let me know if you have any ideas about how to do this.

One solution is to create an PyTorch tensor, convert it to a CuArray by Base.unsafe_wrap, and then manipulate it using CuArrays.jl.
No copy at all.

x_tc = tc.empty((1024,1024), dtype=tc.float64).cuda()
x = Base.unsafe_wrap(CuArray, CuPtr{Float64}(x_tc.data_ptr()), (1024, 1024))
randn!(x)
# randn!(x) modifies PyTorch tensor x_tc because PyTorch tensor x_tc and CuArray x share the same memory.

I haven’t actually tried this code, so there might be minor mistakes…

Or, though I don’t know if that’s really possible, possibly CuArray can be converted directly to PyTorch tensor through torch.utils.dlpack.from_dlpack.