Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

I am not asking for a solution, that’s why I am, well, not interested in a solution. Technically maybe this subtopic is better in the Community category, but anyways.

4 Likes

JuliaIO needs a lot of work. In an ideal world, the packages would only rely on stable APIs and not need much maintenance. However, in my experience, many of the packages in JuliaIO rely on Julia internals to get around garbage collection issues or other performance issues. Many of the packages also suffer from the two-language problem because they wrap a library in another language, and those libraries are also often undermaintained, have bugs of their own, and tend to be overly complicated.

Right now I do all my IO with SmallZarrGroups.jl, ZipArchives.jl, and JSON3.jl and try and help keep those packages and their dependencies working.

5 Likes

Your problem. No point in discussing this in this forum.

well, it’s my problem too. I did the exact same thing as OP and just used Polars to circumvent IO issues

not just with parquet or arrow. I have also had problems with CSV.jl to read (pretty innocent) csv files, and with the aws integrations for IO with s3

in general the “data engineering” corner of the world is prohibitively immature in Julia compared to python. again not blaming maintainers and I appreciate the time they spend on these packages. but it just is what it is

8 Likes

For “curated” packages like those under JuliaIO, there should be no hacks. Just rock solid with mediocre performance. If I want something fast, I can always use some shiny third party packages.

3 Likes

I think there is a bit of a miscommunication about what JuliaIO and similar organizations provide. They are just a convenience mechanism to share work between several people and de-risk in case a maintainer has to leave. These organizations don’t offer any particular guarantee of curation or quality.

10 Likes

I don’t know anything about the Parquet format, but Parquet.jl depends on ThriftJuliaCompiler_jll.jl which is a 3 year old fork of a 30K line C++ project GitHub - tanmaykm/thrift: Mirror of Apache Thrift

Is this heavy dependency needed? If I want to improve Parquet.jl how can I know if any of my changes will have some bad interaction with one of the 30K lines of C++.

2 Likes

Butbutbut… it has Julia in its title!!!

More seriously speaking, it’s kind of like an expectation problem. If Haskell can’t read Parquet or Arrow or have zombie libraries on them, I wouldn’t have any problem with it. I don’t expect Haskell to do that.

Julia is a number crunching language. It is designed from the ground up to do numerical computation. If I can’t input some data within 5 minutes of picking up the language, it’s a serious problem.

And do I want to contribute to the ecosystem? Yes, I contribute to packages, and sometimes even provide patches. This does not happen until I have a good experience first, because I am also doing it on my own free time.

There’s no free time when I am catching a deadline.

10 Likes

First, I highly appreciate the free love authors put into Arrow.jl and Parquet.jl and Parquet2.jl.

I want to echo the observations that, things are lag behind and not nearly as feature complete in Julia world regarding Apache I/O. For the record, I only know a few things about Arrow.jl from working on itself and a similar format hep uses.

Over the years, I have filed a few -correctness - problems or show-stopping bugs related to performance and large, compressed files. I really, really want our Arrow story to be good, hell, I have reverse engineered something in hep that’s as, if not more, complicated as Arrow.jl. So if there’s interests and positive feedback from maintainers, I’d love to throw in my free time in making Arrow.jl better.

I was very early to explore the zero-copy sharing with other Arrow implementation: Re-use PyArrow memory via PyCall · Issue #92 · apache/arrow-julia · GitHub. And we still want to see that happen especially now with efforts like AwkwardArray.jl

Anyway, all of these went nowhere, so, at the moment, it’s hard for me to say, but I feel powerless to help the situation.

6 Likes

For “big data” IO formats such as parquet, and to some extent arrow, I think the long term solution is to have a solid, reliable, and straightforward way to wrap rust libraries in Julia. While it’s great to have whatever effort the community puts into them, the reality is that most Julia developers (including myself) are here for other reasons and not particularly excited about maintaining enterprise IO formats. I am happy to continue maintaining Parquet2.jl, and will fix what issues I can, but it is also a huge format with a bewildering set of features and much of what writes parquets are JVM packages that probably aren’t particularly interested in interop with stuff that’s not JVM. The arrow standard provides the ability to provide low-level buffer views of data in wrapped libraries, so there is every reason to wrap polars and take advantage of all the work that is happening there to deal with this stuff. Many Julia developers have at least some interest in rust, and some great work has been done with jlrs but it would be nice to have more people to continue the effort. There has also been Polars.jl which seems functional, but as far as I know does not do low-level wrapping of arrow buffers, so its applicability may be limited.

At the same time, I would encourage new users to be open to calling dependencies from other languages if needed. This is how relatively new and niche languages are able to establish themselves in the first place. While there may be few to no benefits to using Julia if all you’re doing is taking the output of one black box wrapped function and plugging it into another, and there are sometimes real obstacles to using wrapped packages (such as the difficulties with using pyarrow that motivated me to write Parquet2.jl), there are also a huge number of cases where you can use a dependency for one specific thing that you may not have native support for and it isn’t a big deal.

18 Likes

Right, I mentioned in the original topic, but I think I should restate here:

Julia’s community is awesome. I appreciate all the work put into all these community packages so I can simply type ]add to access all of them.

I do not mean to talk bad things about these packages, as they no doubt required lots of effort from authors and contributors. However, practically speaking, I cannot base my projects on an unstable foundation.

I’ve had the pleasure to debug a full-stack problem where a CSS style sheet entry went through a whole stack of software and caused my desktop environment to crash when I type in one specific window using a specific Chinese input method. I do not want to repeat this on my primary research project.

I think it’s better to have a solid, low-level API than attempting to build something high-level and abstracted. There’s far less work, it allows users to circumvent more bugs by hacking around them, and it serves as a good intermediate API.

Something like a thin wrapper above _jll libraries, just barely making the API julia-like would be a good start.

2 Likes

The practical problem with that is Arrow reference implementation is in C++ not in C, so that’s immediately non-trivial. We have _jll wrappers like that for other applications, especially when the target/reference is in C. Generally I don’t think people are avoiding _jll wrappers for “I want pure Julia”, but when it’s C++ it’s a different ball game

4 Likes

Ok, I know C++ FFI is difficult.

On a side note, I think Arrow’s documentation is pretty bad as well.

4 Likes

I think that would likely come from a group that was entrepreneurial enough to create a commercial ETL framework to use Julia for specialized or udf code, and reuse the rust/c++ libraries for operations which are common across languages.

Just wanted to say that I’ve been able to stream a large arrow file (54 Gb, 4.5 million rows, 2,200 columns) using Arrow.jl + TableOperations.jl to generate complex statistical aggregations with DataFrames.jl while keeping memory usage around 1 Gb. The resulting solution ended up working faster than some C++ code which was using a proprietary binary format, but now a much wider audience can make improvements to that process.

Got a lot of valuable information on how to do that on this thread: How well Apache Arrow’s zero copy methodology is supported? - Specific Domains / Data - Julia Programming Language (julialang.org).

17 Likes

I tried TableOperations when I was doing the project in Julia. Its documentation on streaming tables was incomprehensible. At the end, Polars’s query language was just much more powerful.

That Polars-python itself had a chance of producing wrong results in Jupyter was story of another time…

1 Like

I like OnlineStats.jl quite a lot. a very smooth integration with streaming arrow to that package would be fantastic

3 Likes

A sidebar question. Is C++ FFI difficult only from Julia or also Python?

I got better guidance from forum posts than the actual docs. I feel like they could have more examples and guidance. I am hopeful things are moving in the right direction and will start getting more engaged and try to help as much as possible. I’ve been an R user for close to 15 years and have had a complex love hate relationship with Python over much of that time and doing algorithm implementation work in Julia felt much more natural and empowering.

I do agree that investment needs to be done on supporting read/write to Arrow and Parquet as those are key to any serious data processing solution going forward.

3 Likes

C++ FFI is difficult everywhere, because their names are mangled.

Lingua franca of FFI is C, because its ABI is pretty much the de facto standard of every cross-language calling convention.

All machines speak C, basicaly.

4 Likes

I hate Python, but its IO is so good.

It’s hard to maintain, its module system is lacking, I dislike its syntax, its package management is abyssmal…

However, it’s has a large stdlib, it runs external programs easily, it has so many tutorials in every single module. Any beginner with a hint of CS knowledge can learn it by just reading the official docs. It’s so easy to interface with other tools. It handles many file formats in idiomatic Python…

At some point there are just so many beginners attracted that quantity becomes a quality.

9 Likes