Arrow, Feather, and Parquet

tkf · September 13, 2019, 8:43pm

Continuing the discussion from JDF - an experimental DataFrame serialization format is ready for beta testing:

@ExpandingMan Thank you very much! That’s a great summary and it helps me getting idea of reliability level of feather format. It’s also nice to get a summary of the recent activities in Arrow community.

Reading that Parquet and Arrow C++ communities joining forces last year, I was expecting they were going to merge on-disk file format. Do you know why they are keeping two file formats? Maybe because with arrow format you can just mmap it? (And it’s not possible with parquet?)

Yifan_Liu · September 13, 2019, 9:57pm

Fst is more powerful than arrow.

tkf · September 13, 2019, 10:12pm

Thanks, I didn’t know about fst. Reading https://github.com/fstpackage/fst it looks like fst is R-specific format designed for on-disk usage? I suppose it’s not a very fair comparison since arrow is a format that is language-agnostic and designed primary for in-memory usage.

Yifan_Liu · September 13, 2019, 10:20pm

No, it is written in c++. It is called fstlib.

tkf · September 13, 2019, 10:27pm

I see. It’s good to know that there is another library that is targeting another problem domain.

tkf · September 13, 2019, 10:33pm

Actually it looks like the domain is quite overlapping:

fstlib aims to provide a framework for working with on-disk data by using as little memory and disk space as possible while keeping processing performance at a high level. It does that by utilizing compression, random access and multi-treading as efficiently as possible. In effect, it’s Parguet and Arrow combined in a single package. That’s an important difference: with fst there is no distinction between an in-memory data structure and an on-disk data structure, they are one and the same.

— Compare and contrast with parquet · Issue #129 · fstpackage/fst · GitHub

xiaodai · September 13, 2019, 11:11pm

Aren’t they different things? Fst is an on disk format and arrow is an in memory format. So fst’s comparison point should be something lile parquet & parquet writers and readers.

xiaodai · September 13, 2019, 11:13pm

Also the fst format isn’t published yet and the author wants you to use his C++ code inter language usage. I was hoping to write a pure Julia implementation but seems too hard.

davidanthoff · September 13, 2019, 11:17pm

My sense is that a pure julia parquet writer would be most useful at this point for long term, large data storage and interop. fst seems a bit niche to me.

xiaodai · September 14, 2019, 12:24am

Parquet is like the python of disk formats. Inoffensive and seems to be ubiquitous. So let’s use it. Technically, I have heard murmurs of dissatisfaction with it’s design and implementation.

I want contribute to a parquet reader/writer at some point. Both r and pythin native parquet readers now. So it’s will be important for interop to have native Julia readers. Although pycall with a good arrow.jl implementation would also suffice I’d say.

tkf · September 14, 2019, 1:05am

It would be great if you can share what are the biggest dissatisfactions (or links to the discussions).

xiaodai · September 14, 2019, 1:24am

One source is @ExpandingMan’s discussions. He seems not 100% happy with them.

See this
https://github.com/h2oai/datatable/issues/1109#issuecomment-397077120
I think I saw in the same repo that Paruqet are no good at supporting transpose etc.

See this post from EuroScipy Python dataframes summit.

The post technically is talking about Arrow (which I think is closely related to parquet). Excerpt from the post

Apache arrow C++ API and implementation not following common C++ idioms
Using a monorepo (including all bindings in the same repo as Arrow)
Not a clear distinction between the specification and implementation (as in for instance project Jupyter)

As I described them as “murmurs”, so it’s basically bits and pieces I’ve read from other places. I am sure if you read more and more about parquet you will also find pockets of examples where the technical details aren’t ideal. But I don’t know enough to comment.

But bottom line is: parquet is ubiquitous. My current employer is a Python 80% and R 20% shop; 0% Julia. And having no Julia parquet reader makes it impossible to have them on the pipeline.

tkf · September 14, 2019, 2:11am

Thanks a lot! These are nice overview.

davidanthoff · September 14, 2019, 2:30am

Well, just to be clear, we do have a native julia parquet reader here with integration into the whole table ecosystem here. What we lack is a writer.

xiaodai · September 14, 2019, 2:57am

OK. Last I tests it on the Fannie Mae dataset saved as parquet using python, Parquet.jl can’t read it properly. So… it’s not yet very robust.

davidanthoff · September 14, 2019, 3:01am

Would be great if you could open an issue, no chance of it getting fixed otherwise

ExpandingMan · September 16, 2019, 1:37pm

I found the situation quite confusing myself and believe I have pieced it together over time.

My understanding is that the original goal of arrow was to provide a common in-memory data foramt which would allow all the apache “stuff” (much of which, did not previously have common interfaces), pandas and R to communicate tabular data with each other. As such, I believe arrow was deliberately designed as a rather minimal specification so that a good deal of the data already sitting around in “sane” formats would already comply with arrow. For example, any memory sequential ints or floats already comply with arrow. So, a column of ints in feather, an uncompressed column of ints in parquet, a numpy vector of ints, a Julia vector of ints, all of those have a common format, since it’s pretty much the most obvious one. What arrow does is gives you the language to supply metadata for things like that. Other things, such as null values, are less likely to be in the arrow format already, and in many cases must be generated to generate arrow messages (i.e. in this case it is not zero-copy).

The idea of the arrow feather and parquet readers is to read those file formats and be able to emit messages in a common format (arrow), which is also simple enough to be able to access values from directly.

What I’d like to do in Julia is to convert the Feather.jl and Parquet.jl packages into “metadata translators” which convert their metadata to arrow and vice-versa, enabling read/write with arrow. I believe this is basically what they do in C++.

As for parquet, it’s a completely different format, as far as I know it did not originally have anything to do with arrow. Since arrow is fairly general (for tabular data at least) and simple, many parquets already happen to have data in arrow formats (as all columnar data is very similar). Therefore much of many parquet files can be considered to be “arrow buffers” even though arrow metadata from those must be generated from the parquet metadata.

Anyway, I’m definitely no authority on this, I have no priveledged insight into any of this, this is just what I’ve pieced together over the 2 years or so since I started working on overhauling Feather.jl.

xiaodai · September 16, 2019, 1:48pm

It’s rather confusing for me too. I know that Arrow is in-mem and Parquet is on disk. But what’s giving the impression that they getting tightly tangled is that pyarrow provides the read_parquet functionalties for pandas and R’s {arrow} has the only reliable read_parquet function in native R (i.e. not via slow Spark).

They sit together in both Python and R, so must be related somehow is my logic.

ExpandingMan · September 16, 2019, 2:00pm

As for Arrow itself, the conference post you shared was interesting, and I certainly share some of their apparent consternation about it. As for C++ idioms, I did not spend too much time looking at their code, and though I have a good deal of experience in C++, I was working in the high energy physics community and was somewhat in a bubble, so I’m not sure I’d recognize common C++ practices in the broader software community.

I also had some misgivings about the code being in a monorepo. I felt that, were we to have a Julia arrow package there, it would suddenly become much more difficult to work on because of the friction involved in working on such a large project. The authors seemed very insistent that a monorepo was a good idea even to the extent that they expressed interest on moving Julia code there even very long before it was in a state appropriate for that repo. The main reason they gave for wanting a monorepo was their claim that it made testing easier, which, to be honest, I find rather dubious.

I suspect that one of the reasons they want the monorepo is because they did not do an incredibly thorough job of documenting the standard, instead relying on all the code being in that monorepo where the arrow offers can make it compliant. I suspect this is why people expressed concerns about their not being a clear distinction between the specification and the implementation. A clear specification would require a much more detailed white paper, which they lack and the authors have not shown any interest in, as far as I have seen.

As for the format itself, after my relatively recent work implementing it, I have several concerns about it:

The format is extremely general and supports quite a few composite data types. I therefore don’t really understand why it was decided to keep the format so focused on tabular data. The format itself in principle supports all sorts of data structures, but the metadata seems very strongly focused on tabular data. A good place to see where this has had dubious consequences is that the tensor metadata seems entirely tacked-on. I suspect if they had realized where this was all heading earlier, they would have changed the metadata and wound up with a more generalized IPC data format, which I think still would have been extremely useful, because, perhaps surprisingly, the alternatives are lacking.
The metadata seems to have unnecessary inconsistencies and this apparently causes significantly more code to have to be written and maintained. For example, strings are basically a structure that they have called List<UInt8>, but rather than implementing it this way, there are several places where the metadata is different for strings. This makes it much harder than it should be to write generic code that works for many different data types.
It is very hard to do much of anything without individual arrow messages or batches, which is a little unfortunate as it makes data harder to construct. For example, if a batch simply had metadata with it saying that it is a particular binary data type (like int or float), the format would be far less dependent on the header, and rather than having to orchestrate the construction of an object accross a bunch of separate batches, one could build it as if by bricks. I realized they wanted to keep the metadata small because this is an IPC format, but I felt there were a few places where they could have made the metadata slightly bigger in the batches for a lot more convenience (along with smaller metadata in the header and possibly even less reliance on referencing the specification).

I suspect a lot of the things that seem odd about the format have something to do with pandas or Python, but I don’t really know. Also bear in mind that I know very little about what it takes to make a really good IPC format, so take what I say with a grain of salt. The arrow authors may have perfectly good reasons for much of what I’m complaining about here, in which case I’d be happy to retract my criticism.

Having said all this, there is already quite a lot of stuff that arrow has been implemented for, and it could be useful for Julia in all sorts of ways. For people to be willing to use Julia at all in the “big data” world it’s going to have to do things like interact efficiently with spark, which will require reading and writing arrow.

I have not given up on my overhaul of Arrow.jl, and I have made lots of significant progress (I can serialize and deserialize many messages), but I’ve repeatedly gotten derailed by work and personal stuff, so I can’t say exactly when I’ll return to it in earnest.

ExpandingMan · September 16, 2019, 2:02pm

Like I said, you can (and they do) create a parquet reader that emits arrow messages. That’s what’s going on there, take a look at the parquet file format. It’s different than arrow, it even uses a different serialization format for the metadata.

Topic		Replies	Views
What fileformat to use to load data for high performance computing Machine Learning	37	7139	December 1, 2018
Writing Parquet files General Usage	28	5361	November 12, 2020
Apache Arrow 1.0 release Data arrow	7	1950	September 5, 2020
The poor state of fileformats for High Performance computing General Usage	16	2681	August 13, 2017
[ANN] Arrow.jl 0.3 Release Data arrow	21	3253	March 16, 2021

Arrow, Feather, and Parquet

Related topics