Passing an Arrow Table from Python to Julia

Hi,

I have a Python job that calls Julia for some computation on my datasets. Right now, passing data back and forth between Julia and Python is a bottleneck – my current process is to save the Pandas DataFrame as a feather file to disk, and load it from Julia.

I know PyJulia can pass numpy arrays with 0 copying, and I’m trying to figure out if it’s possible to do the same thing with an Arrow Table. However, a table gets passed as a PyObject, and the Arrow library in Julia doesn’t seem to be able to convert it, even on the latest version.

import pandas as pd
df = pd.DataFrame({"id": [1, 2],
                  "name": ["bob", "sam"]})
table = pyarrow.Table.from_pandas(df)


import julia
jl = julia.Julia(compiled_modules=False)
from julia import Arrow
from julia import Base
Base.length([1, 2, 3])
>> 3
Base.length(table)
>> 2
Base.typeof(table)
>> <PyCall.jlwrap PyObject>
Arrow.Table(table)
RuntimeError: <PyCall.jlwrap (in a Julia function called from Python)
JULIA: MethodError: no method matching Arrow.Table(::PyObject)

Is there a better way to pass datasets between the two?

Arrow.Table needs to either be provided an arrow-formatted file as a string (like Arrow.Table(file)), or a byte vector Vector{UInt8}. You can see an example at least from the Julia side of “round tripping” arrow data w/ pyarrow here: arrow-julia/pyarrow_roundtrip.jl at main · apache/arrow-julia · GitHub. So in your case, I’d see if there’s a way to get access from your table object to the raw arrow-formatted bytes.

2 Likes

Thanks! I was able to get a PyJulia version working with a few small modifications:

df = pd.DataFrame({"id": [1, 2],
                  "name": ["bob", "sam"]})
batch = pyarrow.record_batch(df)
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, batch.schema)
writer.write_batch(batch)
writer.close()
buf = sink.getvalue()
jbytes = buf.to_pybytes()
tt = Arrow.Table(bytearray(jbytes))