Arrow, Feather, and Parquet

wesm · September 26, 2019, 10:09pm

@sairus7 comparing HDF5 with a columnar file format is truly apples and oranges. HDF5 does not provide a columnar data model for datasets nor metadata describing schemas for such. I’ve used HDF5 for more than 10 years now, and it serves as “a place to put raw bytes or a collection of ndarrays/tensors”. The memory model of data that is stored in Arrow format (in-memory or on disk) or Parquet format on disk is significantly different.

To make the distinction between HDF5 and Parquet clear: you can feed Parquet directly into a SQL-based query engine without any special logic but no such thing is possible with HDF5 without layering some kind of opinionated “semantic layer” on top of HDF5 to provide a columnar data interpretation of some collection of arrays stored inside.

@ExpandingMan I wrote about Feather’s history and trajectory in Wes McKinney - Feather format update: Whence and Whither?. The discussed plan indeed is to deprecate the “feather.fbs” file and have Feather be simply an alias for the Arrow IPC file format.

I waited patiently for the R community to build bindings for the Arrow C++ library and get them on CRAN, but that did not happen until August 2019, nearly 3.5 years after the initial release of Feather. Until that happened, it wasn’t possible for me to modify the format. We have no plans to do any further development in GitHub - wesm/feather: Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow.

I was a little puzzled why you guys kept such a low profile and didn’t advertise that format more

This seems like a pretty subjective judgment. Perhaps we did not “market” the binary protocol in the way that you’re suggesting. We’re not trying to displace other storage formats, for example. We’ve created a lot of technical content, blog posts, slide decks, etc. illustrating the performance and interoperability benefits of using the Arrow format. It’s is being used in numerous downstream open source applications and many more proprietary applications. On the basis of implementation maturity and downstream adoption it would seem that we’ve reached many of our intended audiences.

At the end of the day, we are an open source community and we do not have any commercial entities profiting directly from adoption of Apache Arrow. I would rather have Julia developers part of our community and work together on these problems, including the technical evangelism.

Topic		Replies	Views
Writing Parquet files General Usage	28	5437	November 12, 2020
What fileformat to use to load data for high performance computing Machine Learning	37	7203	December 1, 2018
Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project General Usage	57	3878	May 6, 2024
Apache Arrow 1.0 release Data arrow	7	1953	September 5, 2020
Help with Arrow.jl and size of files Data question , arrow	23	2037	October 21, 2022

Arrow, Feather, and Parquet

Related topics