Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

For “big data” IO formats such as parquet, and to some extent arrow, I think the long term solution is to have a solid, reliable, and straightforward way to wrap rust libraries in Julia. While it’s great to have whatever effort the community puts into them, the reality is that most Julia developers (including myself) are here for other reasons and not particularly excited about maintaining enterprise IO formats. I am happy to continue maintaining Parquet2.jl, and will fix what issues I can, but it is also a huge format with a bewildering set of features and much of what writes parquets are JVM packages that probably aren’t particularly interested in interop with stuff that’s not JVM. The arrow standard provides the ability to provide low-level buffer views of data in wrapped libraries, so there is every reason to wrap polars and take advantage of all the work that is happening there to deal with this stuff. Many Julia developers have at least some interest in rust, and some great work has been done with jlrs but it would be nice to have more people to continue the effort. There has also been Polars.jl which seems functional, but as far as I know does not do low-level wrapping of arrow buffers, so its applicability may be limited.

At the same time, I would encourage new users to be open to calling dependencies from other languages if needed. This is how relatively new and niche languages are able to establish themselves in the first place. While there may be few to no benefits to using Julia if all you’re doing is taking the output of one black box wrapped function and plugging it into another, and there are sometimes real obstacles to using wrapped packages (such as the difficulties with using pyarrow that motivated me to write Parquet2.jl), there are also a huge number of cases where you can use a dependency for one specific thing that you may not have native support for and it isn’t a big deal.

18 Likes