Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

I was curious if you could expand on this a bit. In my experience Arrow.jl is well supported and performant.

I haven’t used it much but there is also Parquet2.jl which is actively developed.

5 Likes

Arrow.jl does not support the C interface so it’s not possible to zero copy an arrow object between languages. It also doesn’t support ADBC or Arrow Flight, which I think would be really beneficial to allow users to host a Flight server with Julia as a backend.

For all IO, I have found using DuckDB.jl and its support across other languages to be helpful, although its client api is not officially supported.

Instead of JSON + StructTypes, I have found Serde.jl to be helpful.

Arrow.jl doesn’t support streaming correctly. It also reads the entire arrow data into memory, despite the file being uncompressed. I had some other issues which I don’t remember that forced me to first switch to Parquet because at least Parquet was supported by Julia, then even Parquet.jl broke. Then I switched to DuckDB.jl, and alas, duckdb 0.9.1 segfaulted only on Julia. By that time I just switched to Python and Rust fully.

2 Likes

Are you talking about Parquet2.jl and Arrow.jl? What problems did you have?

Edit: This is discussed above already.

Arrow.jl and Parquet.jl.

Arrow.jl has:

  • No memory-mapped reads
  • No real streaming (doc says it has, but it actually reads the whole table in memory)

Parquet.jl has:

  • Can’t read my data, period.

I didn’t know Parquet2 before I switched language. I mean, Parquet.jl is supposed to be maintained by JuliaIO, so I never looked elsewhere.

2 Likes

well, Parquet2.jl cannot read my parquet file :slight_smile:

through no fault of @ExpandingMan — I appreciate his prompt attention when I have submitted issues — I guess it is due to some known issue in CodecLz4.jl

but the fact remains that I cannot really use Julia for this task

Huh, is thread-splitting some discourse magic I am not aware of?

Well, I am a happy user of Arrow.jl… Much faster to store and read files in .arrow format than doing that with .csv files…

I admit, I don’t have files larger than my RAM…

But I think reading files larger than your RAM should work…

One immediate, though perhaps not incredibly useful example, is Arrow.Stream.
Arrow.Stream implements Tables.partitions in that it iterates “tables” (specifically Arrow.Table), and as such, Arrow.write will iterate an Arrow.Stream, and write out each Arrow.Table as a separate record batch.

If it doesn’t work for you, please provide an MWE that reproduces the problem.

2 Likes

I don’t remember exactly why I dropped Arrow.jl. Either streaming straight up didn’t work, or working with Table.partitions was too tricky.

I was dealing with some IPC files that were nearly 30GB, with zstd compression.

Well, if you say something doesn’t work or is too tricky without asking for a solution here on Discourse, well, this is really your problem, and not a problem of Julia or the Julia ecosystem…

1 Like

Well, between “figuring out myself”, “asking questions”, “trying another format”, and “switch to Python and have it working right now”, I simply chose the last.

11 Likes

There has only been three commits in the six months to Arrow.jl since the repo moved under Apache.

Apache arrow asked for examples from each language on how to serve arrow via http, and almost all languages except Julia have been completed.

I have never seen arrow passed to Rcall or python via zero copy in Julia.

There is more support in other languages than just reading and writing ipc files.

1 Like

This indicates that you do not have the need to write fast code yourself, that you are happy what Python libraries provide… No problem, but if I would write my code in Python it would be 10 to 100 times slower than the code I write in Julia, and THAT would be a problem for me…

OP said they’re also using Rust, which is famously fast.

I think our best option here is to thank OP for their detailed experience report and try to learn what we can from it.

26 Likes

Python is never used by itself. In this case, I used Polars, which is written in Rust and gave a python API. It is just as fast as if I used Julia.

In my original topic, I said that Julia is great if I need fast user defined function, and it shows when I needed to write UDF for Polars that goes through pycall, which was very slow. However, if I use Julia by itself, IO is a big problem.

Running some slow python code is still faster than debugging something as fundamental as an IO package. Here’s something I think is very important for me:

I DON’T want to deal with IO problems. If I can’t even import data correctly just by reading documentation, it’s out. I lose trust.

11 Likes

Your problem. No point in discussing this in this forum.

No one is forcing you to reply.

4 Likes

This forum is exactly the sort of place where we can discuss our Julia problems, and benefit from the help of others. Let’s keep it as welcoming and useful as we can.

36 Likes

Well, there is no point in discussion problems with people who are not looking for a solution…

If you think this discussion is pointless you are free to ignore it. But please don’t discourage others from asking questions or providing answers

22 Likes