Arrow, Feather, and Parquet

tkf · September 16, 2019, 7:10pm

It reminds me of this talk

IIRC it was about particle physics people using Arrow as a building block for non-tabular data structures.

ExpandingMan · September 16, 2019, 7:21pm

Wow, very interesting, this talk bridges the phases of my adult life lol.

I find it really disheartening how Python seems to be working its way more deeply into HEP world. Anyway, this definitely gives me a bit more motivation to work on Arrow as I would love to make it easier for HE physicists to use Julia.

wesm · September 19, 2019, 11:52am

My workshop from VLDB might be helpful in clearing up some of the confusion about how Arrow’s binary protocol works

(See slide deck within)

On some of the other questions it might be worth a direct conversation with the Arrow community on dev@arrow.apache.org. We would prefer to work directly with the Julia community and for Julia to become a first class citizen in the Apache Arrow world.

EDITS (since I can’t reply more)

There are some concerns being raised here to suggest that it’s difficult to contribute to the Apache project, and that somehow having a monorepo is part of the problem. There’s no evidence at all that this is the case. The project is adding 10 or more new contributors every month, here is a graph of cumulative unique contributors over time

The monorepo structure of the project has actually been critical to the project’s success from a developer productivity and testing perspective. We have tight-knit collaborations between different subgroups of developers. You might not expect a C+±focused person to help a Go, Java, or JavaScript developer but that’s the kind of community we have created.

There are some other criticisms being made here that I don’t agree with. I’d rather take up those further in more constructive fashion on the Arrow developer mailing list.

wesm · September 19, 2019, 11:56am

Hi, I would like to have an opportunity to address your complaints and clear up your confusion. Can you please write to dev@arrow.apache.org? The slide deck I posted above should also help address confusion as it gives a deep dive into the columnar format. I have also recently cleaned up the specification to be clearer for implementation creators.

Zach_Christensen · September 19, 2019, 3:21pm

So I shouldn’t be using “.feather” anymore?

wesm · September 19, 2019, 4:05pm

You can certainly keep using Feather files. The internal details of Feather files are going to change in the next 12 months to take advantage of the work we have done in the Arrow project, but we will maintain compatibility code for some period of time to read old files.

xiaodai · September 20, 2019, 12:06am

Welcome @wesm

StefanKarpinski · September 20, 2019, 5:05pm

It should be possible to create a subdirectory of the monorepo that is a proper, registered Julia package, so I don’t think there’s any technical reason why this can’t be done.

wsphillips · September 20, 2019, 7:20pm

Disclaimer: Maybe naive comment/question.

While the benefit of a common storage format like Arrow is obvious for tabular data like DataFrames, would it also be useful for storing generic Julia data structures like n-dimensional arrays (thinking images, simulation data, sensor data etc)?? Or is HDF5 + derivatives (e.g. JLD2) more or less the way forward for that type of data?

wesm · September 21, 2019, 6:32pm

We have some facilities for serializing and memory-mapping ndarrays using common memory abstractions and metadata serialization in the Arrow project but generic storage or scientific ndarray data hasn’t been a central focus of the community relative to the tabular data problem.

wsphillips · September 25, 2019, 8:25pm

Edit: for those interested there’s some more comments here https://github.com/apache/arrow/issues/4802

Hm. So reading the documentation and your comment:

I can see there’s tooling for working with tensors (ndarrays), and I can see the serialization. So what’s missing is a tensor-compatible interface to Parquet? For instance I can see Python and Rust examples moving Arrow table structures to a Parquet file, but I am guessing that’s not so straightforward with a tensor?

In any case, I really like the idea of an efficient common in memory (and via Parquet on disk) data format, with huge community support. I am excited to see where it goes.

wesm · September 26, 2019, 7:18pm

Parquet doesn’t have a tensor/ndarray value type, but you could embed tensor data in a BYTE_ARRAY value if you wanted. The format is not designed for general storage of numerical datasets, for that you would be better off using HDF5. Parquet is designed for analytics / SQL-style query processing

ExpandingMan · September 26, 2019, 7:51pm

@wesm, could you comment some on feather vs the arrow on disk serialization format?

It had seemed to me that feather was developed very early in arrow development, and I suspected this had something to do with why its metadata is completely different from arrows IPC metadata format. It seemed to me far more sensible to just abandon feather and start using the arrow on-disk file format, and I was a little puzzled why you guys kept such a low profile and didn’t advertise that format more. To be honest, I never seriously intended on using feather again after the Julia arrow readers and writers were finished (though I had intended on continuing to maintain Feather.jl).

Also, I’d like to take the opportunity to thank you and the arrow community for all your work pushing the arrow format and making it practical. Getting interroperability between all these things is no small feat. Any criticism I might have had for the format above is with the benefit of hindsight, and of course I also recognize that in some places may actually be more a reflection of my ignorance than the design of the format. Having use of arrow is infinitely preferable to whatever would have been required before.

sairus7 · September 26, 2019, 9:09pm

Can someone explain, what’s the difference between Arrow/Feather/Parquet vs HDF5? It seems the latter is pretty powerful and has a nice performance, except for parallel writes.

wesm · September 26, 2019, 10:09pm

@sairus7 comparing HDF5 with a columnar file format is truly apples and oranges. HDF5 does not provide a columnar data model for datasets nor metadata describing schemas for such. I’ve used HDF5 for more than 10 years now, and it serves as “a place to put raw bytes or a collection of ndarrays/tensors”. The memory model of data that is stored in Arrow format (in-memory or on disk) or Parquet format on disk is significantly different.

To make the distinction between HDF5 and Parquet clear: you can feed Parquet directly into a SQL-based query engine without any special logic but no such thing is possible with HDF5 without layering some kind of opinionated “semantic layer” on top of HDF5 to provide a columnar data interpretation of some collection of arrays stored inside.

@ExpandingMan I wrote about Feather’s history and trajectory in Wes McKinney - Feather format update: Whence and Whither?. The discussed plan indeed is to deprecate the “feather.fbs” file and have Feather be simply an alias for the Arrow IPC file format.

I waited patiently for the R community to build bindings for the Arrow C++ library and get them on CRAN, but that did not happen until August 2019, nearly 3.5 years after the initial release of Feather. Until that happened, it wasn’t possible for me to modify the format. We have no plans to do any further development in GitHub - wesm/feather: Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow.

I was a little puzzled why you guys kept such a low profile and didn’t advertise that format more

This seems like a pretty subjective judgment. Perhaps we did not “market” the binary protocol in the way that you’re suggesting. We’re not trying to displace other storage formats, for example. We’ve created a lot of technical content, blog posts, slide decks, etc. illustrating the performance and interoperability benefits of using the Arrow format. It’s is being used in numerous downstream open source applications and many more proprietary applications. On the basis of implementation maturity and downstream adoption it would seem that we’ve reached many of our intended audiences.

At the end of the day, we are an open source community and we do not have any commercial entities profiting directly from adoption of Apache Arrow. I would rather have Julia developers part of our community and work together on these problems, including the technical evangelism.

sairus7 · September 27, 2019, 12:27am

Thanks for clarifying that, it’s really interesting!

Some time ago we have moved our structured storage to HDF5 from a directory tree witn file streams, just because HDF5 compiles for different architectures, has support for streaming data with chunking and compression, and also it’s easy to use from any environment.

However, we indeed wrote an additional layer for reading and writing data arrays and handle table-like indexing (mainly for timeseries arrays form sensors and analytics results over them).

I have 2 questions then:

It seems like we can use all of this functionality as well in Arrow, but with additional SQL-like layer?
Is it possible to “wrap” some data providers with Arrow, so they simulate data columns that are not stored in file, but calculated and cached on the fly as requested (from another file or from another data columns)?

NumesSanguis · December 2, 2019, 5:54am

Parquet doesn’t have a tensor/ndarray value type, but you could embed tensor data in a BYTE_ARRAY value if you wanted. The format is not designed for general storage of numerical datasets, for that you would be better off using HDF5. Parquet is designed for analytics / SQL-style query processing

I recently came across AwkwardArray (GitHub - scikit-hep/awkward-0.x: Manipulate arrays of complex data structures as easily as Numpy.), which is a (Python, C++ in the future) library that has as purpose to “Manipulate arrays of complex data structures as easily as Numpy.”. This library supports arrays with different length (which is different from numpy, which only supports rectangular arrays).

Since it has support for Parquet and Apache Arrow, it seems that it is possible to store ndarrays in Parquet after all.

wsphillips · December 2, 2019, 9:44am

The most promising format I have seen so far is probably TileDB. With honorable mention going to the potential future prospects of N5/Zarr (both projects are merging going forward).

I think TileDB storage with (optionally) Arrow in-memory objects could do some nifty things

Steven_Sagaert · December 2, 2019, 12:39pm

I was aware of N5/Zarr from a blog post about the pangeo project from the dask blog but I didn’t know about TileDB.That seems pretty sweet.

For older array db systems check out http://rasdaman.com/ & https://www.paradigm4.com/try_scidb/

tkf · December 2, 2019, 8:49pm

For people interested in TileDB, there is a vote for Julia support:

Topic		Replies	Views
What fileformat to use to load data for high performance computing Machine Learning	37	7140	December 1, 2018
Writing Parquet files General Usage	28	5362	November 12, 2020
Apache Arrow 1.0 release Data arrow	7	1950	September 5, 2020
The poor state of fileformats for High Performance computing General Usage	16	2681	August 13, 2017
[ANN] Arrow.jl 0.3 Release Data arrow	21	3254	March 16, 2021

Arrow, Feather, and Parquet

Related topics