Pyarrow conversion with PythonCall

tbeason · May 27, 2023, 2:58pm

It would appear to me that PythonCall.jl does not convert pyarrow tables to DataFrames. I get the error

ERROR: cannot convert this Python 'Table' to a Julia 'PyTable'

Since PyTable(x) wraps Tables.jl compatible tables, I was hoping that it would “just work”.

Is there a quick way to get this working? My goal is no-copy conversion from the python object to the Julia object.

jling · May 27, 2023, 2:59pm

github.com/apache/arrow-julia

Re-use PyArrow memory via PyCall

opened 01:02AM - 29 Dec 20 UTC

closed 09:10PM - 22 Jan 23 UTC

Moelf

Hi @quinnj , thank you for just willing to consider this wild attempt! The only …pkg you need to re-create is PyArrow and [awkward-1.0](https://github.com/scikit-hep/awkward-1.0) on the python side and PyCall.jl on Julia side. Create example `arr`: ```julia julia> using PyCall julia> ak = pyimport("awkward"); julia> arr = ak.Array(py"[[1,2,3], [], [4,5]]") PyObject <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'> julia> arr.layout PyObject <ListOffsetArray64> <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x0000037ff330"/></offsets> <content><NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x000003a256a0"/></content> </ListOffsetArray64> ``` Then you can get an pyarrow object via: ```julia julia> arr_arrow = ak.to_arrow(arr) PyObject <pyarrow.lib.ListArray object at 0x7f2144343048> ... .. julia> @time [Int64[x...] for x in arr] 0.034800 seconds (38.72 k allocations: 2.031 MiB) 3-element Array{Array{Int64,1},1}: [1, 2, 3] [] [4, 5] ``` Currently the fastest / least copy method of re-using as been: ```julia function view_ak(arr) c = PyArray(arr.layout."content") o = PyArray(arr.layout."offsets") @views [c[o[i]+1:o[i+1]] for i in 1:length(o)-1] end julia> @time view_ak(arr) 0.000089 seconds (37 allocations: 1.609 KiB) 3-element Array{SubArray{Int64,1,PyArray{Int64,1},Tuple{UnitRange{Int64}},false},1}: [1, 2, 3] 0-element view(::PyArray{Int64,1}, 4:3) with eltype Int64 [4, 5] ```

the latest “working” example is this: Re-use Awkward / pyarrow IPC for Julia Arrow.jl · GitHub

tbeason · May 27, 2023, 3:06pm

Yikes. That looks a bit too fragile for me. The problem exists for polars dataframes as well, it seems. Is that really the only way?

Is pandas conversion “no copy”? I just thought that using pyarrow or polars would be better since they have more focus on performance.

jling · May 27, 2023, 3:16pm

think about this at a higher level. In-memory, pandas or polars or pyarrow tables are just a bunch of bytes with schema/pointers. So to do “non-copy”, you either have to make Julia able to re-interpret data bytes, or you have to manipulate Python in-memory structure through Python.

You want option #1, but understanding a impl-dependent blob is hard, which is why Arrow exists, the “batch” is intended to use as IPC blob. But, the IPC assumes the bytes are “contiguous” (otherwise you can’t easily pass it around). When you open a feather file with pyarrow or polars, it’s MMap-ed and in-memory has Rust/C++/Python stuff, the “write to buffer” is a way to say “now please give me something fully up to spec with Arrow IPC I need to use it elsewhere”

tbeason · May 27, 2023, 3:20pm

That I understand. But the gist isn’t exactly written in a way that is completely certain of its correctness lol. If this functionality were to go into PythonCall, do you think that would be the way it would be implemented? How safe is it to use? I want no copy but I also want no (silently) lost data!

jling · May 27, 2023, 3:23pm

all of the apis I used are public, for both pyarrow and Arrow.jl

you mean you want someone to put hacky code behind a function call so you don’t know what it’s actually doing? PythonCall itself is plenty hacky and there have been a few “reliability” issues.

If you have specific concerns other than “this looks hacky because it’s just some random person’s gist” I’m happy to see if I can address

tbeason · May 27, 2023, 3:30pm

Not that! Terms like “practically viable” and “This in principle allows one…” don’t make it seem like the final answer, that’s all.

I will try it out. I assume that all of the awkward stuff is not relevant since I already have a pyarrow table? And when I call to_batches(), I get a list rather than 1, so I guess I need to write them all to the sink and then do take!?

jling · May 27, 2023, 3:36pm

Honestly yes. the IPC bytes written out by pyarrow is completely safe because it’s spec-ed. It would be far more unreliable to rely on Python internal objects and try to re-use pointer to each Arrow page, probably impossible

jling · May 27, 2023, 3:37pm

yeah, indeed, the gist is dealing with something that only has 1 batch

tbeason · May 27, 2023, 3:49pm

Very, very cool. Only modification that I needed was to do

pywith(pa.ipc.new_stream(jl_sink, first(pa_batches).schema)) do writer
        writer.write_batch.(pa_batches)
end;

Works fairly fast on a 5M x 12 dataframe.

tbeason · May 27, 2023, 3:57pm

For reference (mostly my future self), here is a function that contains the code:

# requires DataFrames, Arrow, PythonCall + pyarrow
function pyarrow_to_jldf(pyarr)

    pa_batches = pyarr.to_batches()

    jl_sink = IOBuffer()

    pywith(pa.ipc.new_stream(jl_sink, first(pa_batches).schema)) do writer
            writer.write_batch.(pa_batches)
    end;

    df = DataFrame(Arrow.Table(take!(jl_sink)))

    close(jl_sink)

    return df

end

jling · May 27, 2023, 4:03pm

I have had issues with pythonCall when pkged into a Julia packages, maybe try PyCall in that case

Topic		Replies	Views
Passing an Arrow Table from Python to Julia Data python , arrow	2	1457	February 26, 2021
Is it possible to do a simple example of using Arrow.jl to pass data from/to Python? General Usage python , arrow	3	707	April 28, 2021
Converting Pandas Dataframe returned from PyCall to Julia DataFrame General Usage pycall , dataframes	18	5607	May 27, 2022
Sharing python memory Internals & Design question , memory , python	17	1873	December 27, 2020
Reading and writing Apache arrow files General Usage question , package , arrow	4	5755	May 28, 2022

Pyarrow conversion with PythonCall

Related topics