Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

liuyxpp · April 19, 2024, 1:59am

I have taken a look at Serde.jl and found it to be COOL. Skimming the doc, it should be much easier to use than JSON3.jl and YAML.jl. Why doesn’t it announce here in this discourse so that more people will see it?

mkitti · April 19, 2024, 2:28am

My one lament here is that I’m hearing about these problems now, after your experience, and not earlier when you were having these troubles. I might have been able to help. You may have very well reached out then, and I missed it.

I am a member of the JuliaIO organization particularly because I have a specific interest in HDF5.jl, which is central to one of my projects. While I mostly work on it to advance my own interests, I eventually gained enough expertise in the package and the underlying C library to help others. For a while my permissions were scoped only to HDF5.jl, but I now have more permissions. That said I usually do not work on other repositories on the organization unless prompted. In summary, I joined JuliaIO because I had an IO problem in Julia and then chose to fix it.

The JuliaIO organization, the other Github organizations, and even the Julia organization itself really just consists of groups of people. There’s actually nothing particularly special about this group other than a common interest in building upon Julia. Some of them may be paid to work on Julia, but most people there are mainly there to work on particular packages they have an interest in. Even if paid to work on Julia, they are probably to work on another part of it than you want - unless you are paying them to do so. If anything, a package being under one of these organizations means that the original maintainer may have moved on to other things and had the forethought of opening maintenance to the community rather than abandoning the project outright.

I remember when I was a new graduate student. I liked idea of open source. I used R for my first project, but eventually I needed to work on image processing which led me to focus on MATLAB for about a decade because its what other people around me used. Looking at Python then, it was just not mature enough, and R was not particularly great at image processing. Yes, I get it that Julia is not for everyone right now.

At the moment, I’m fortunate to have the capacity to choose my tools and work on open source today as well as some capability to resolve the issues. Actually, one reason I like partcipating in Julia development is that I can participate and that I can fix the problems I encounter.

While Python does work for most things, its when it does not work that the pain really seems to start for me. While I do contribute to some Python and Java packages occasionally, I actually have always found it to be a fairly frustrating open source experience relative to Julia. The Python effort is usually spread over some combination of Python, Cython, C and Fortran and a variety of non-interoperating tools. Moreover the approach to performance is often self-defeating. The Java effort is a quagmire when it comes to interacting with native code. All around though open source is greatly under staffed. I do wonder about the sustainability of other projects because I find them harder to contribute to, but I also acknowledge they have had more time and resources to get where they are.

My guess here is that Arrow.jl and Parquet.jl worked well enough for the people that worked on them. I’m not sure why it did my work quite as well for you. Perhaps you use it in a different way than they do. If you have specific issues you can report, I hope you find some time to write them up.

Overall, I find the expectations of open source users to be quite mismatched with the reality of open source. By and large open source contributors are volunteers and mostly normal users themselves. This is not a particular critique of the original poster, but I do find the expectation that open source packages will just work for everything far from reality.

I thank you for the time you took to write down your experiences, and I understand that things did not meet your expectations. I’m fortunate to be in a place that when I encountered a similar situation I had the time and capability to get more involved. While I know that not everyone can respond likewise, I encourage you to find ways to support the open source projects you do end up using. Frankly, the only way open source will work is if everyone contributes.

rongcuid · April 19, 2024, 3:01am

Thank you for being a part of the FOSS community. When I did my project, I just encountered about 5 IO libraries in a row, spanning multiple different formats, that simply didn’t work on Julia. I got so sick of it because every time I tried another format, I needed to hand roll another conversion tool, which despite being written in Rust, still needed to run for >40 minutes. In my case, the problem was that most of these libraries were adequate at “serializing”: that is, working with small files that reside in memory, and read/written as a whole. This doesn’t work for the amount of traces I analyzed. The moment I touch something big, the IO libraries simply broke.

I feel like R’s ecosystem was quite similar to TeX… heck, it’s called CRAN. From the experience I had from high school, I still remember having to use a dictionary to translate the English documentation for R packages, while figuring out its syntax with no guidance, using those HTML and PDF documentation completely offline, with an Atom netbook hidden on my lap, while attending some boring lecture I was not listening to. It actually worked, and I completed multiple assignments with it (despite that our statistics course was taught in Excel). R’s documentation is legendary. Can you imagine anyone achieving this with Julia? The only other FOSS languages with this level of documentation is, you might have guessed it, TeX, which I suppose is only natural. Or maybe Gnuplot, which I also brute forced through its docs.

I really think that most of Python’s pain points come only after one already got started. For instance, I would generally have no problem importing data and doing simple processing. Then after a while, all kinds of performance and maintenance issue creep up… but I already have a code base, a “reference implementation” if you will. I can pile more sh!t on top of it because sunk cost, I can refactor it, or I can even rewrite it in another language, given that I have a better understanding of the problem.

I have done this process multiple times. Just quickly write a Python program which parses 1% of the data I need. If it’s not horribly slow and if I don’t crash my computer, I keep it and proceed with the rest of data. Otherwise I write an analyzer with Rust.

Read it as this: by the time one encounters problems with Python, they are already too deep to return; Julia stops them on their first step.

Thus why in the original post, even though it’s not the main focus, I say IO is an especially important part of the ecosystem. A language is of no use if it can’t ingest data.

I see that Julia is focusing on important problems. I experienced the TTFX improvements going from Julia 1.0 to 1.10. So I know it’s addressing common road blocks and turn-offs. I think IO and documentation are pretty solid next goals. I like the language, but I think it’s immature.

merlin · April 19, 2024, 3:48am

I also am very thankful for Parquet2, ExpandingMan has been going all out on this for a while. The documentation has taught me a lot about the Parquet format and I am soooo glad to be able to not install a JVM to work with Parquet.

As an alternate, DuckDB has a Julia API and reads Parquet with ease. That project is actively developed by a team and its a quite popular DB.

rongcuid · April 19, 2024, 3:49am

It stuck on 0.9.1 for 11 months. Months that coincide with my project, on a specific version that crashed only the Julia binding. Very unfortunate indeed.

asbisen · April 19, 2024, 4:53am

What do you mean stuck at 0.9.x. Version 0.9.x was released only 6 months back. And I have been using 0.10 for a while now through Julia.

rongcuid · April 19, 2024, 11:28am

0.9.1 segfaults on julia. Actually it happens in the wrapper. I asked for a version bump for months, but a 0.9.2 never happened, so I switched to python. 0.10 was long after that.

rdavis120 · April 19, 2024, 4:09pm

It was an issue in the Julia client api making an invalid call to the C api, but it supports your point that the Julia client is not as well supported as other languages. It is important to recognize that they are explicit in that Julia is not an officially supported client, but they have accepted pull requests for the Julia client very quickly. The c api also supports streaming query results as well, but the Julia api doesn’t implement the Tables.partitions interface for this yet.

simsurace · April 19, 2024, 10:36pm

What‘s the deal with the examples of serving arrow data over HTTP? That should be fairly easy to do. Has no-one attempted this?

rdavis120 · April 20, 2024, 10:55am

I was able to serve arrow by writing the table to an io buffer; I’m not sure if it could be done without having that extra copy. It would be great if you could stream it without the copy.

You can consume arrow by passing the response into the Arrow.Table(bytes::Vector{UInt8} function.

I was not able to serve parquet for my dataset because Parquet.write didn’t support DateTime types.

GeorgeGkountouras · April 20, 2024, 1:28pm

According to the docs Arrow.Stream("a_file") uses mmap (only the first time? on every read?). Is that correct?

Can CxxWrap.jl be used to call the Arrow reference implementation?

simsurace · April 20, 2024, 1:45pm

Writing to an IOBuffer seems like the way to go, I think that’s how I did it in the past. The additional copy you mention seems hard to avoid if the table is not already a memory-mapped or in-memory arrow table? Or maybe I’m misunderstanding you. Usually the tables you serve via http will be some Tables.jl table that was generated in Julia that first needs to be serialized to a byte vector by Arrow.jl. That seems to be exactly what a reusable IOBuffer that is emptied when sending the table over the network is designed for.

I think it would be nice to submit a complete example to arrow-experiments/http/get_simple at main · apache/arrow-experiments · GitHub. I can try to write one up unless somebody wants to go first.

rongcuid · April 20, 2024, 3:54pm

I don’t remember, please don’t ask for more details. It’s last year and I decided the switch with good reasons.

simsurace · April 20, 2024, 6:25pm

I started writing a Julia example here. The server does not yet work for the full size example (100 million rows) because it triggers some libuv issue. I need to figure out why. I also only tested against Python client and server.

Feel free to post suggestions/comments in the PR.

TheLateKronos · April 24, 2024, 11:24am

I want to plug the JuliaPackageComparisons page on FileIO. The current content is more or less something I threw together in a couple of afternoons, but still I think it is a rather good resource for an overview of the file IO landscape in julia. PRs are very welcome - I have only used a small fraction of the packages, but written all the content

simsurace · May 1, 2024, 10:05am

The Julia implementation I started at Add Julia example [WIP] by simsurace · Pull Request #29 · apache/arrow-experiments · GitHub seems to be functional, but seems to be less performant than other implementations. Let’s discuss optimization in another thread.

Palli · May 6, 2024, 11:43am

I suppose you’re right about Python, but it doesn’t need to be this way in Julia, and I think all your problems are avoidable. I.e. if this is about I/O, Arrow or any other file format supported by e.g. Python, or in fact any other Python library (or e.g. C) you can just call it with PythonCall.jl (but I discuss below about doing without Python, with or without Rust):

Do you mean it has tutorials for every module in Python’s stdlib (as part of their docs, plausible, I don’t know if true, what matters is existing somewhere, and if this was on any module, not just in stdlib then for sure false). Julia also runs external programs easily(?), better than Python, or shell, i.e. has improved on.

Julia supports a very long-tail of file formats already (and at least the most common databases, including DuckDB and Oracle). I thought the long list was actually quite impressive, and also the Arrow.jl support (though I’ve never used it myself), and e.g. state-of-the-art speed of CSV.jl. I was hoping the reasons NOT to use Julia were going away…

Right, that said I found:

Application: passing data between R and Python

The R and Python Arrow libraries are both based on the Arrow C++ library, however their respective toolchains (mandated by the R and Python packaging standards) are ABI-incompatible. It is therefore impossible to pass data directly at the C++ level between the R and Python bindings.

Using the C Data Interface, we have circumvented this restriction and provide a zero-copy data sharing API between R and Python. It is based on the R reticulate library.

I suppose it could but it seems it should not, i.e. would have same problem as above. [Note, elsewhere it’s be suggested Julia could be the neutral language to make packages; for also Python, R, Stata, and other statistics. It wasn’t on I/O, nor should the suggestion be restricted to statistics. I think it’s a sort of valid point, but that neutral language could also be e.g. Rust, at least for I/O.]

In some cases you only want Julia to work. Or a) with Python, or b) with R or c) both…

I would say the first priority is getting Julia to work alone (but to me that doesn’t rule out using e.g. C or Rust dependency, like Polars). But ideally it shouldn’t rule out a), highest priority IMHO; nor b) or c). Others might see working with R a higher priority, that depends.

So then it seems like wasted effort to use the C++ API (also bad to do “pure Julia”?).

I don’t know, but at least consider if Parquet2.jl is the future, or rather Polars.jl. And even if these data formats are the future. Meta/Facebook is making a replacement for Parquet data format, if I recall. Not yet open source, but claimed it will be.

Which is also awesome, and potentially the very important file format, though I want those other standard formats supportable, somehow, in a great way.

Thank you for making it. Even if Polars[.jl] is [not] actually better in every way, please link to it from the docs as an alternative. Also Parquet.jl should also link to (your package and/or) Polars.jl.

I see Polars it’s based on has e.g.:

Hybrid Streaming (larger-than-RAM datasets)

In short, famously all languages have a difficulty with wrapping C++ (or Rust) libraries, unless they make a C API, that both languages support doing (and then you’re calling C, not really C++, directly).

I’ve seen Polars is well supported from Python, I think since it uses its C API, and I see no reason that a Julia wrapper can’t be as good, as any wrapper for a Rust library such as Polars, from any other language. Yes, it theory it could be outdated, or not existing yet, then you can use Python’s wrapper.

It IS easy to call C++ from Python after some work is done to support such C++ code, with relevant Python libraries, and I don’t know of them breaking later. That is opposed to CxxWrap.jl, which has tended to break (some brand-new solution also exists, plus Cxx.jl that had a similar problem), because it relies on Julia’s unstable API (I think Julia’s C API; Julia has a syntax guarantee, NOT a stable C API guarantee, only has a stability guarantee for most or all(?) other [non-C] API). And Julia has repeatedly broken it, thus CxxWrap, meaning it has to release a new version amending the situation. I hope that is coming to an end, i.e. no longer a need to change Julia’s [C] API and/or that they make it officially stable. Note, calling from Julia to C has never been a problem, so I’m not up-to-speed why calling to C++ is problematic regarding CxxWrap, i.e. why it used the C API, i.e. meant for embedding Julia, to call Julia from another language, not from.

As I already explained you may not want to call the C++ API anyway, at least if you care about R, for Arrow and otherwise. Note, most R packages, I understand or all high-performance at least, are implemented in C++ (when not R alone), so R can certainly call C++, just not in a way useful for Python to reuse (directly)… or for Julia.

[Note, the opposite problem: you can call to Julia from C++ with jluna documentation and I’ve never heard of it breaking even though it relies on the C API of Julia (maybe because it’s more recent? Or avoids problematic API CxxWrap uses? Meaning some of the C API is stable?), makes easier to call Julia to not have to call it directly.]

Yes, more relying on C (and Fortran) than C++ (though sometimes), as apposed to R. it would be great if they standardize on a common language, Julia or other, e.g. Rust.

We want Julia packages in general to be “pure Julia” (to be generic for any type), but file I/O is an interesting exception. Fast parsing is very hard, and I’M NOT saying Julia can’t do as fast, as e.g. Rust or C++, just that we want to reuse code, and that code is sometimes in assembly, e.g. for UTF-8 parsing (or rather validation, that Julia doesn’t do by default). More importantly for file types, only a certain number of number formats are supported by the file type, so the I/O code doesn’t need to be generic (at least for reading, for writing (the O in I/O), we might want to convert arbitrary types, e.g. DecFP, or not, and could support in the wrapper, or let people handle conversion).

Was that with compresson? I don’t know how implemented, mmap is sometimes used, and talk of support missing in Julia (might be outdated info, and it seems it might only support uncompressed).

It’s great to know of, I suppose you sort of announced it… at least now I know, I think the main problem is for popular file formats, and packages named after, since they take the good/best name they should be responsible for pointing to (possibly) better alternatives. E.g. Parquet2.jl wouldn’t have a discovery problem that way, or JSON3.jl and I actually suggested a doc PR in 2021 (not yet merged, now updated to include Serde.jl):

Marcelo_Simas · May 6, 2024, 2:53pm

No, compression slows down the process and requires more memory. I was going for speed and not size at rest efficiency. I tested with compressed files (and parquet files), and it worked, was just slower and speed was very important in this application.

Topic		Replies	Views
[ANN] Arrow.jl 0.3 Release Data arrow	21	3170	March 16, 2021
Unable to write DataFrame to Parquet or Arrow? Data question	7	607	July 27, 2021
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7396	May 8, 2024
What are the extensions to Arrow (Feather) and Parquet and are they supported in Julia? Offtopic	0	336	October 28, 2022
File IO - Parquet File Reader Data	4	1196	October 30, 2018

Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project

Application: passing data between R and Python

Related topics