think about this at a higher level. In-memory, pandas or polars or pyarrow tables are just a bunch of bytes with schema/pointers. So to do “non-copy”, you either have to make Julia able to re-interpret data bytes, or you have to manipulate Python in-memory structure through Python.
You want option #1, but understanding a impl-dependent blob is hard, which is why Arrow exists, the “batch” is intended to use as IPC blob. But, the IPC assumes the bytes are “contiguous” (otherwise you can’t easily pass it around). When you open a feather file with pyarrow or polars, it’s MMap-ed and in-memory has Rust/C++/Python stuff, the “write to buffer” is a way to say “now please give me something fully up to spec with Arrow IPC I need to use it elsewhere”
That I understand. But the gist isn’t exactly written in a way that is completely certain of its correctness lol. If this functionality were to go into PythonCall, do you think that would be the way it would be implemented? How safe is it to use? I want no copy but I also want no (silently) lost data!
all of the apis I used are public, for both pyarrow and Arrow.jl
you mean you want someone to put hacky code behind a function call so you don’t know what it’s actually doing? PythonCall itself is plenty hacky and there have been a few “reliability” issues.
If you have specific concerns other than “this looks hacky because it’s just some random person’s gist” I’m happy to see if I can address
Not that! Terms like “practically viable” and “This in principle allows one…” don’t make it seem like the final answer, that’s all.
I will try it out. I assume that all of the awkward stuff is not relevant since I already have a pyarrow table? And when I call to_batches(), I get a list rather than 1, so I guess I need to write them all to the sink and then do take!?
Honestly yes. the IPC bytes written out by pyarrow is completely safe because it’s spec-ed. It would be far more unreliable to rely on Python internal objects and try to re-use pointer to each Arrow page, probably impossible