Pyarrow conversion with PythonCall

It would appear to me that PythonCall.jl does not convert pyarrow tables to DataFrames. I get the error

ERROR: cannot convert this Python 'Table' to a Julia 'PyTable'

Since PyTable(x) wraps Tables.jl compatible tables, I was hoping that it would “just work”.

Is there a quick way to get this working? My goal is no-copy conversion from the python object to the Julia object.

the latest “working” example is this: Re-use Awkward / pyarrow IPC for Julia Arrow.jl · GitHub

1 Like

Yikes. That looks a bit too fragile for me. The problem exists for polars dataframes as well, it seems. Is that really the only way?

Is pandas conversion “no copy”? I just thought that using pyarrow or polars would be better since they have more focus on performance.

think about this at a higher level. In-memory, pandas or polars or pyarrow tables are just a bunch of bytes with schema/pointers. So to do “non-copy”, you either have to make Julia able to re-interpret data bytes, or you have to manipulate Python in-memory structure through Python.

You want option #1, but understanding a impl-dependent blob is hard, which is why Arrow exists, the “batch” is intended to use as IPC blob. But, the IPC assumes the bytes are “contiguous” (otherwise you can’t easily pass it around). When you open a feather file with pyarrow or polars, it’s MMap-ed and in-memory has Rust/C++/Python stuff, the “write to buffer” is a way to say “now please give me something fully up to spec with Arrow IPC I need to use it elsewhere”

That I understand. But the gist isn’t exactly written in a way that is completely certain of its correctness lol. If this functionality were to go into PythonCall, do you think that would be the way it would be implemented? How safe is it to use? I want no copy but I also want no (silently) lost data!

:man_shrugging:

all of the apis I used are public, for both pyarrow and Arrow.jl

you mean you want someone to put hacky code behind a function call so you don’t know what it’s actually doing? PythonCall itself is plenty hacky and there have been a few “reliability” issues.

If you have specific concerns other than “this looks hacky because it’s just some random person’s gist” I’m happy to see if I can address

Not that! Terms like “practically viable” and “This in principle allows one…” don’t make it seem like the final answer, that’s all.

I will try it out. I assume that all of the awkward stuff is not relevant since I already have a pyarrow table? And when I call to_batches(), I get a list rather than 1, so I guess I need to write them all to the sink and then do take!?

Honestly yes. the IPC bytes written out by pyarrow is completely safe because it’s spec-ed. It would be far more unreliable to rely on Python internal objects and try to re-use pointer to each Arrow page, probably impossible

1 Like

yeah, indeed, the gist is dealing with something that only has 1 batch

1 Like

Very, very cool. Only modification that I needed was to do

pywith(pa.ipc.new_stream(jl_sink, first(pa_batches).schema)) do writer
        writer.write_batch.(pa_batches)
end;

Works fairly fast on a 5M x 12 dataframe.

For reference (mostly my future self), here is a function that contains the code:

# requires DataFrames, Arrow, PythonCall + pyarrow
function pyarrow_to_jldf(pyarr)

    pa_batches = pyarr.to_batches()

    jl_sink = IOBuffer()

    pywith(pa.ipc.new_stream(jl_sink, first(pa_batches).schema)) do writer
            writer.write_batch.(pa_batches)
    end;

    df = DataFrame(Arrow.Table(take!(jl_sink)))

    close(jl_sink)

    return df

end
2 Likes

I have had issues with pythonCall when pkged into a Julia packages, maybe try PyCall in that case