Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

chris-b1 · April 18, 2024, 3:58pm

I was curious if you could expand on this a bit. In my experience Arrow.jl is well supported and performant.

I haven’t used it much but there is also Parquet2.jl which is actively developed.

rdavis120 · April 18, 2024, 4:21pm

Arrow.jl does not support the C interface so it’s not possible to zero copy an arrow object between languages. It also doesn’t support ADBC or Arrow Flight, which I think would be really beneficial to allow users to host a Flight server with Julia as a backend.

For all IO, I have found using DuckDB.jl and its support across other languages to be helpful, although its client api is not officially supported.

Instead of JSON + StructTypes, I have found Serde.jl to be helpful.

rongcuid · April 18, 2024, 4:21pm

Arrow.jl doesn’t support streaming correctly. It also reads the entire arrow data into memory, despite the file being uncompressed. I had some other issues which I don’t remember that forced me to first switch to Parquet because at least Parquet was supported by Julia, then even Parquet.jl broke. Then I switched to DuckDB.jl, and alas, duckdb 0.9.1 segfaulted only on Julia. By that time I just switched to Python and Rust fully.

jar1 · April 18, 2024, 6:43pm

Are you talking about Parquet2.jl and Arrow.jl? What problems did you have?

Edit: This is discussed above already.

rongcuid · April 18, 2024, 6:46pm

Arrow.jl and Parquet.jl.

Arrow.jl has:

No memory-mapped reads
No real streaming (doc says it has, but it actually reads the whole table in memory)

Parquet.jl has:

Can’t read my data, period.

I didn’t know Parquet2 before I switched language. I mean, Parquet.jl is supposed to be maintained by JuliaIO, so I never looked elsewhere.

adienes · April 18, 2024, 6:47pm

well, Parquet2.jl cannot read my parquet file

through no fault of @ExpandingMan — I appreciate his prompt attention when I have submitted issues — I guess it is due to some known issue in CodecLz4.jl

but the fact remains that I cannot really use Julia for this task

rongcuid · April 18, 2024, 7:04pm

Huh, is thread-splitting some discourse magic I am not aware of?

ufechner7 · April 18, 2024, 7:07pm

Well, I am a happy user of Arrow.jl… Much faster to store and read files in .arrow format than doing that with .csv files…

I admit, I don’t have files larger than my RAM…

But I think reading files larger than your RAM should work…

One immediate, though perhaps not incredibly useful example, is Arrow.Stream.
Arrow.Stream implements Tables.partitions in that it iterates “tables” (specifically Arrow.Table), and as such, Arrow.write will iterate an Arrow.Stream, and write out each Arrow.Table as a separate record batch.

If it doesn’t work for you, please provide an MWE that reproduces the problem.

rongcuid · April 18, 2024, 7:18pm

I don’t remember exactly why I dropped Arrow.jl. Either streaming straight up didn’t work, or working with Table.partitions was too tricky.

I was dealing with some IPC files that were nearly 30GB, with zstd compression.

ufechner7 · April 18, 2024, 7:21pm

Well, if you say something doesn’t work or is too tricky without asking for a solution here on Discourse, well, this is really your problem, and not a problem of Julia or the Julia ecosystem…

rongcuid · April 18, 2024, 7:25pm

Well, between “figuring out myself”, “asking questions”, “trying another format”, and “switch to Python and have it working right now”, I simply chose the last.

rdavis120 · April 18, 2024, 7:43pm

There has only been three commits in the six months to Arrow.jl since the repo moved under Apache.

Apache arrow asked for examples from each language on how to serve arrow via http, and almost all languages except Julia have been completed.

I have never seen arrow passed to Rcall or python via zero copy in Julia.

There is more support in other languages than just reading and writing ipc files.

ufechner7 · April 18, 2024, 8:20pm

This indicates that you do not have the need to write fast code yourself, that you are happy what Python libraries provide… No problem, but if I would write my code in Python it would be 10 to 100 times slower than the code I write in Julia, and THAT would be a problem for me…

jar1 · April 18, 2024, 8:29pm

OP said they’re also using Rust, which is famously fast.

I think our best option here is to thank OP for their detailed experience report and try to learn what we can from it.

rongcuid · April 18, 2024, 8:30pm

Python is never used by itself. In this case, I used Polars, which is written in Rust and gave a python API. It is just as fast as if I used Julia.

In my original topic, I said that Julia is great if I need fast user defined function, and it shows when I needed to write UDF for Polars that goes through pycall, which was very slow. However, if I use Julia by itself, IO is a big problem.

Running some slow python code is still faster than debugging something as fundamental as an IO package. Here’s something I think is very important for me:

I DON’T want to deal with IO problems. If I can’t even import data correctly just by reading documentation, it’s out. I lose trust.

ufechner7 · April 18, 2024, 8:38pm

Your problem. No point in discussing this in this forum.

rongcuid · April 18, 2024, 8:38pm

No one is forcing you to reply.

gdalle · April 18, 2024, 8:44pm

This forum is exactly the sort of place where we can discuss our Julia problems, and benefit from the help of others. Let’s keep it as welcoming and useful as we can.

ufechner7 · April 18, 2024, 8:46pm

Well, there is no point in discussion problems with people who are not looking for a solution…

gdalle · April 18, 2024, 8:47pm

If you think this discussion is pointless you are free to ignore it. But please don’t discourage others from asking questions or providing answers

Topic		Replies	Views
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7421	May 8, 2024
Arrow, Feather, and Parquet Data parquet , arrow	48	12944	November 1, 2020
Writing Parquet files General Usage	28	5248	November 12, 2020
What's the latest and greatest in data in Julia Data	29	2105	August 15, 2024
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	856	August 31, 2022

Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

Related topics