Arrow, Feather, and Parquet

Continuing the discussion from JDF - an experimental DataFrame serialization format is ready for beta testing:

@ExpandingMan Thank you very much! That’s a great summary and it helps me getting idea of reliability level of feather format. It’s also nice to get a summary of the recent activities in Arrow community.

Reading that Parquet and Arrow C++ communities joining forces last year, I was expecting they were going to merge on-disk file format. Do you know why they are keeping two file formats? Maybe because with arrow format you can just mmap it? (And it’s not possible with parquet?)

Fst is more powerful than arrow.

Thanks, I didn’t know about fst. Reading https://github.com/fstpackage/fst it looks like fst is R-specific format designed for on-disk usage? I suppose it’s not a very fair comparison since arrow is a format that is language-agnostic and designed primary for in-memory usage.

No, it is written in c++. It is called fstlib.

I see. It’s good to know that there is another library that is targeting another problem domain.

Actually it looks like the domain is quite overlapping:

fstlib aims to provide a framework for working with on-disk data by using as little memory and disk space as possible while keeping processing performance at a high level. It does that by utilizing compression, random access and multi-treading as efficiently as possible. In effect, it’s Parguet and Arrow combined in a single package. That’s an important difference: with fst there is no distinction between an in-memory data structure and an on-disk data structure, they are one and the same.

Compare and contrast with parquet · Issue #129 · fstpackage/fst · GitHub

1 Like

Aren’t they different things? Fst is an on disk format and arrow is an in memory format. So fst’s comparison point should be something lile parquet & parquet writers and readers.

1 Like

Also the fst format isn’t published yet and the author wants you to use his C++ code inter language usage. I was hoping to write a pure Julia implementation but seems too hard.

My sense is that a pure julia parquet writer would be most useful at this point for long term, large data storage and interop. fst seems a bit niche to me.

4 Likes

Parquet is like the python of disk formats. Inoffensive and seems to be ubiquitous. So let’s use it. Technically, I have heard murmurs of dissatisfaction with it’s design and implementation.

I want contribute to a parquet reader/writer at some point. Both r and pythin native parquet readers now. So it’s will be important for interop to have native Julia readers. Although pycall with a good arrow.jl implementation would also suffice I’d say.

1 Like

It would be great if you can share what are the biggest dissatisfactions (or links to the discussions).

2 Likes

One source is @ExpandingMan’s discussions. He seems not 100% happy with them.

See this
https://github.com/h2oai/datatable/issues/1109#issuecomment-397077120
I think I saw in the same repo that Paruqet are no good at supporting transpose etc.

See this post from EuroScipy Python dataframes summit.

The post technically is talking about Arrow (which I think is closely related to parquet). Excerpt from the post

  • Apache arrow C++ API and implementation not following common C++ idioms
  • Using a monorepo (including all bindings in the same repo as Arrow)
  • Not a clear distinction between the specification and implementation (as in for instance project Jupyter)

As I described them as “murmurs”, so it’s basically bits and pieces I’ve read from other places. I am sure if you read more and more about parquet you will also find pockets of examples where the technical details aren’t ideal. But I don’t know enough to comment.

But bottom line is: parquet is ubiquitous. My current employer is a Python 80% and R 20% shop; 0% Julia. And having no Julia parquet reader makes it impossible to have them on the pipeline.

1 Like

Thanks a lot! These are nice overview.

Well, just to be clear, we do have a native julia parquet reader here with integration into the whole table ecosystem here. What we lack is a writer.

7 Likes

OK. Last I tests it on the Fannie Mae dataset saved as parquet using python, Parquet.jl can’t read it properly. So… it’s not yet very robust.

Would be great if you could open an issue, no chance of it getting fixed otherwise :slight_smile:

3 Likes

I found the situation quite confusing myself and believe I have pieced it together over time.

My understanding is that the original goal of arrow was to provide a common in-memory data foramt which would allow all the apache “stuff” (much of which, did not previously have common interfaces), pandas and R to communicate tabular data with each other. As such, I believe arrow was deliberately designed as a rather minimal specification so that a good deal of the data already sitting around in “sane” formats would already comply with arrow. For example, any memory sequential ints or floats already comply with arrow. So, a column of ints in feather, an uncompressed column of ints in parquet, a numpy vector of ints, a Julia vector of ints, all of those have a common format, since it’s pretty much the most obvious one. What arrow does is gives you the language to supply metadata for things like that. Other things, such as null values, are less likely to be in the arrow format already, and in many cases must be generated to generate arrow messages (i.e. in this case it is not zero-copy).

The idea of the arrow feather and parquet readers is to read those file formats and be able to emit messages in a common format (arrow), which is also simple enough to be able to access values from directly.

What I’d like to do in Julia is to convert the Feather.jl and Parquet.jl packages into “metadata translators” which convert their metadata to arrow and vice-versa, enabling read/write with arrow. I believe this is basically what they do in C++.

As for parquet, it’s a completely different format, as far as I know it did not originally have anything to do with arrow. Since arrow is fairly general (for tabular data at least) and simple, many parquets already happen to have data in arrow formats (as all columnar data is very similar). Therefore much of many parquet files can be considered to be “arrow buffers” even though arrow metadata from those must be generated from the parquet metadata.

Anyway, I’m definitely no authority on this, I have no priveledged insight into any of this, this is just what I’ve pieced together over the 2 years or so since I started working on overhauling Feather.jl.

5 Likes

It’s rather confusing for me too. I know that Arrow is in-mem and Parquet is on disk. But what’s giving the impression that they getting tightly tangled is that pyarrow provides the read_parquet functionalties for pandas and R’s {arrow} has the only reliable read_parquet function in native R (i.e. not via slow Spark).

They sit together in both Python and R, so must be related somehow is my logic.

As for Arrow itself, the conference post you shared was interesting, and I certainly share some of their apparent consternation about it. As for C++ idioms, I did not spend too much time looking at their code, and though I have a good deal of experience in C++, I was working in the high energy physics community and was somewhat in a bubble, so I’m not sure I’d recognize common C++ practices in the broader software community.

I also had some misgivings about the code being in a monorepo. I felt that, were we to have a Julia arrow package there, it would suddenly become much more difficult to work on because of the friction involved in working on such a large project. The authors seemed very insistent that a monorepo was a good idea even to the extent that they expressed interest on moving Julia code there even very long before it was in a state appropriate for that repo. The main reason they gave for wanting a monorepo was their claim that it made testing easier, which, to be honest, I find rather dubious.

I suspect that one of the reasons they want the monorepo is because they did not do an incredibly thorough job of documenting the standard, instead relying on all the code being in that monorepo where the arrow offers can make it compliant. I suspect this is why people expressed concerns about their not being a clear distinction between the specification and the implementation. A clear specification would require a much more detailed white paper, which they lack and the authors have not shown any interest in, as far as I have seen.

As for the format itself, after my relatively recent work implementing it, I have several concerns about it:

  • The format is extremely general and supports quite a few composite data types. I therefore don’t really understand why it was decided to keep the format so focused on tabular data. The format itself in principle supports all sorts of data structures, but the metadata seems very strongly focused on tabular data. A good place to see where this has had dubious consequences is that the tensor metadata seems entirely tacked-on. I suspect if they had realized where this was all heading earlier, they would have changed the metadata and wound up with a more generalized IPC data format, which I think still would have been extremely useful, because, perhaps surprisingly, the alternatives are lacking.
  • The metadata seems to have unnecessary inconsistencies and this apparently causes significantly more code to have to be written and maintained. For example, strings are basically a structure that they have called List<UInt8>, but rather than implementing it this way, there are several places where the metadata is different for strings. This makes it much harder than it should be to write generic code that works for many different data types.
  • It is very hard to do much of anything without individual arrow messages or batches, which is a little unfortunate as it makes data harder to construct. For example, if a batch simply had metadata with it saying that it is a particular binary data type (like int or float), the format would be far less dependent on the header, and rather than having to orchestrate the construction of an object accross a bunch of separate batches, one could build it as if by bricks. I realized they wanted to keep the metadata small because this is an IPC format, but I felt there were a few places where they could have made the metadata slightly bigger in the batches for a lot more convenience (along with smaller metadata in the header and possibly even less reliance on referencing the specification).

I suspect a lot of the things that seem odd about the format have something to do with pandas or Python, but I don’t really know. Also bear in mind that I know very little about what it takes to make a really good IPC format, so take what I say with a grain of salt. The arrow authors may have perfectly good reasons for much of what I’m complaining about here, in which case I’d be happy to retract my criticism.

Having said all this, there is already quite a lot of stuff that arrow has been implemented for, and it could be useful for Julia in all sorts of ways. For people to be willing to use Julia at all in the “big data” world it’s going to have to do things like interact efficiently with spark, which will require reading and writing arrow.

I have not given up on my overhaul of Arrow.jl, and I have made lots of significant progress (I can serialize and deserialize many messages), but I’ve repeatedly gotten derailed by work and personal stuff, so I can’t say exactly when I’ll return to it in earnest.

5 Likes

Like I said, you can (and they do) create a parquet reader that emits arrow messages. That’s what’s going on there, take a look at the parquet file format. It’s different than arrow, it even uses a different serialization format for the metadata.