Help converting a Pytorch tensor to Julia CuArray

I am trying to use PythonCall/juliacall to convert a PyTorch tensor to a Julia CuArray directly from the pointer (I want to avoid unnecessary copying if possible). Does anyone know what I am doing wrong, as it seems the output isn’t correct since the array should consist of 1s and 0s only (unless I am just misreading it)?

MWE Colab Link

Code

import torch

sz = (100, 100)
arr = np.random.choice([0, 1], size=sz)

# Step 1: Create a PyTorch tensor and transfer it to GPU
tensor = torch.tensor(arr, dtype=torch.float32).cuda()
print(tensor)

# Step 2: Convert to CuArray using PythonCall
cu_arr = jl.unsafe_wrap(jl.CuArray, jl.PythonCall.getptr(tensor), sz)
print(cu_arr)

Output

tensor([[0., 1., 0.,  ..., 1., 0., 0.],
        [0., 1., 1.,  ..., 1., 1., 1.],
        [0., 1., 1.,  ..., 0., 1., 0.],
        ...,
        [0., 0., 0.,  ..., 1., 1., 0.],
        [0., 1., 1.,  ..., 0., 1., 1.],
        [1., 1., 1.,  ..., 0., 0., 0.]], device='cuda:0')
PythonCall.C.PyObject[PythonCall.C.PyObject(2, Ptr{Nothing} @0x00005bbbbba26dd0) PythonCall.C.PyObject(0, Ptr{Nothing} @0x00005bbbbba26dd0) PythonCall.C.PyObject(133973972021008, Ptr{Nothing} @0x000079daf8e4c1b0) PythonCall.C.PyObject(133978876351936, Ptr{Nothing} @0x000079d93f02f5b0) PythonCall.C.PyObject(19, Ptr{Nothing} @0x7c25d4026712e6d2)...

PythonCall.getptr is an internal function and does not get the pointer you want.

You can get the CUDA pointer from the __cuda_array_interface__.

Oh okay, so if I do this to get the underlying pointer, is there a way to convert that to a Julia pointer using juliacall?

import torch

sz = (100, 100)
arr = np.random.choice([0, 1], size=sz)

# Step 1: Create a PyTorch tensor and transfer it to GPU
tensor = torch.tensor(arr, dtype=torch.float32).cuda()

# Step 2: Get pointer of the tensor
ptr = tensor.data_ptr()
print("pointer: ", ptr)
pointer:  138064357793280

CUDA.jl provides a CuPtr{T} but I am having trouble converting this to a pointer that Julia understands

CuPtr{Float32}(pyconvert(UInt, Ptr))

perhaps?

Oh this is almost there, I just need to figure out how to pass in a python variable into a jl.seval("") call. When I hardcode the pointer integer, this works

import torch

sz = (100, 100)
arr = np.random.choice([0, 1], size=sz)

# Step 1: Create a PyTorch tensor and transfer it to GPU
tensor = torch.tensor(arr, dtype=torch.float32).cuda()
print("Pytorch Tensor : ", tensor)

# Step 2: Get pointer of the tensor
ptr = tensor.data_ptr()
print("pointer: ", ptr)

# DOESN"T WORK
# cu_ptr = jl.seval("""
# CuPtr{Float32}(pyconvert(UInt, ptr))
# """)

# Convert to julia CuPtr (IDK how to pass in the variable `ptr`)
cu_ptr = jl.seval("""
CuPtr{Float32}(pyconvert(UInt, 138064357793280))
""")
print("julia pointer: ", cu_ptr)

# Convert to CUDA array
cu_arr = jl.unsafe_wrap(jl.CuArray, cu_ptr, sz)
cu_arr
Pytorch Tensor :  tensor([[0., 0., 0.,  ..., 0., 1., 1.],
        [0., 1., 0.,  ..., 1., 1., 1.],
        [0., 1., 1.,  ..., 0., 1., 0.],
        ...,
        [0., 1., 0.,  ..., 0., 1., 1.],
        [1., 0., 1.,  ..., 0., 1., 1.],
        [1., 1., 0.,  ..., 1., 0., 0.]], device='cuda:0')
pointer:  138064357752832
julia pointer:  CuPtr{Float32}(0x00007d919d009e00)
100×100 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0  1.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  1.0  …  1.0  1.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  1.0  0.0  1.0  1.0  0.0  0.0     1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0
 0.0  0.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0  0.0     1.0  0.0  1.0  1.0  0.0  1.0  0.0  0.0  1.0
 0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  1.0  0.0     0.0  0.0  1.0  0.0  1.0  1.0  0.0  1.0  1.0
 1.0  1.0  0.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  0.0  1.0  …  0.0  0.0  1.0  1.0  1.0  0.0  1.0  0.0  0.0
 1.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0     1.0  0.0  1.0  1.0  1.0  0.0  1.0  0.0  1.0
 1.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0     1.0  0.0  1.0  0.0  0.0  1.0  0.0  1.0  0.0
 0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0     0.0  1.0  1.0  0.0  1.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  1.0  1.0  0.0  1.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0
 ⋮                        ⋮                        ⋱                      ⋮                   
 0.0  0.0  1.0  1.0  1.0  1.0  1.0  0.0  1.0  1.0     0.0  0.0  1.0  0.0  0.0  1.0  0.0  1.0  0.0
 0.0  1.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  0.0  0.0  1.0  1.0
 0.0  0.0  1.0  0.0  1.0  0.0  1.0  0.0  1.0  1.0     1.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0
 0.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0     1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  …  0.0  0.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0
 0.0  1.0  0.0  1.0  0.0  0.0  0.0  1.0  1.0  0.0     0.0  0.0  1.0  1.0  0.0  1.0  0.0  1.0  1.0
 1.0  1.0  1.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0     1.0  0.0  1.0  1.0  0.0  1.0  0.0  1.0  0.0
 0.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0     1.0  0.0  1.0  0.0  1.0  0.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  1.0

If you don’t mind the extra dependency, GitHub - pabloferz/DLPack.jl: Julia interface for dlpack makes this trivial.

Is that for CUDA only? I wrote the kernels for a package using KernelAbstractions.jl so I want to build this out in a vendor neutral way. I am just using Pytorch and CUDA for testing the implementation in python but it seems like the pointer approach might be more flexible than DLPack for this?

Specifically this package btw https://github.com/Dale-Black/DistanceTransforms.jl/blob/master/src/transform.jl

It is, but you could extend https://github.com/pabloferz/DLPack.jl/blob/main/src/cuda.jl to work for other GPU array types. See https://github.com/pabloferz/DLPack.jl/blob/main/src/DLPack.jl#L36-L39

You can create an anonymous function to pass ptr in to:

cu_ptr = jl.seval("""
ptr -> CuPtr{Float32}(pyconvert(UInt, ptr))
""")(ptr)

(But the above suggestions to use DLPack are indeed simpler.)

1 Like

Brilliant, thank you so much. I will look into DLPack more too. If it’s easy to extend to various GPU vendors that is incredible

Hi everyone,

I wanted to revisit this topic and see if anyone has any insights or recommendations on efficiently converting a PyTorch tensor to a Julia CuArray while keeping the data on the GPU throughout the process.

Regarding the pointer approach, I tried the following:

def cu_transform(tensor):
    sz = tuple(tensor.size())
    ptr = tensor.data_ptr()
    cu_ptr = jl.seval("""
    ptr -> CuPtr{Float32}(pyconvert(UInt, ptr))
    """)(ptr)
    cu_arr = jl.unsafe_wrap(jl.CuArray, cu_ptr, sz)
    return jl.transform(jl.boolean_indicator(cu_arr))

This approach works fine with torch.cuda tensors outside of training loops. However, when running it inside a training loop, I encountered a CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS). I suspect there might be an issue with the memory management or synchronization between PyTorch and Julia within the training loop context.

As an alternative, I explored using DLPack.jl to handle the tensor conversion. According to the DLPack.jl README.md, I tried the following:

pyv = torch.arange(1, 5).reshape(2, 2)
v = DLPack.from_dlpack(pyv)

Unfortunately, I encountered an error:

JuliaError: ArgumentError: The input does not follow the DLPack specification
Stacktrace:
 [1] from_dlpack(o::PyIterable{Any})
   @ DLPack ~/.julia/packages/DLPack/1mZGE/src/DLPack.jl:194

Other workarounds I attempted also led to different errors.

My ultimate goal is to create a wrapper around a CUDA.jl-based function using PythonCall.jl (juliacall) for a Python library. The function works on GPU with a simple torch tensor outside of training loops, but I’m running into issues when integrating it into a training loop. It’s crucial for me to avoid converting back and forth between GPU and CPU, as it would negate the efficiency gains I’m aiming for.

If anyone has successfully used the pointer approach or DLPack.jl within a PyTorch training loop while keeping the data on the GPU throughout the process, I would greatly appreciate your guidance. Additionally, if there are any recommendations on how to seamlessly utilize multiple GPU platforms within a PyTorch training loop, similar to how diffeqpy does it, that would be fantastic.

Thank you in advance for any help or code examples you can provide!

Here is the working DLPack.jl version, thanks to @pabloferz

def transform_cuda(tensor):
    tensor_jl = DLPack.from_dlpack(juliacall.convert(PythonCall.Py, tensor))
    result_jl = DistanceTransforms.transform(DistanceTransforms.boolean_indicator(tensor_jl))
    return DLPack.share(result_jl, torch.from_dlpack)
1 Like

I just released DLPack.jl version 0.3.0. With it, you should be able to just do:

tensor_jl = DLPack.from_dlpack(tensor)

from juliacall.

3 Likes

Not sure why but got a little speed boost from that as well!

1 Like