Arrow, Feather, and Parquet

wsphillips · December 2, 2019, 10:01pm

I had seen those, but both looked crufty/neglected…really hoping something fresh and more friendly to non-MPI methods of distributed computing pans out. HDF5 has always been there, but the codebase is literally millions of lines and its maintained by a small number of developers (bless their souls), so updates to the standard come very slowly and very incrementally. Hard for it to keep up with changing landscape of data science needs (unfortunately!)

ihnorton · December 9, 2019, 7:48pm

@wsphillips thanks for the mention of TileDB. This thread and some of the links were very interesting to read. @wesm makes the important point that existing columnar and scientific data formats were designed to solve distinct problems:

However, with TileDB we aim to bridge the gap between scientific and columnar formats with a single efficient, cloud-optimized array format. TileDB provides fast dense array support, solving various limitations of HDF5 (e.g., parallel writes). But TileDB can also model dataframes with its sparse array support, by selecting a subset of its columns to represent the dimensions (see this discussion). Our integrations with Spark, MariaDB and PrestoDB show how we can feed TileDB arrays into SQL-based query engines. We are currently working to round out TileDB’s dataframe capability with heterogenous dimension types, support for string dimensions, and predicate push-down, which will be released in the next few months.

We’ve recently discussed our vision to simplify data science through the integration of careful format design and efficient storage engine library, capable of updates, partitioning logic, and optimizations. For those not familiar with TileDB: in a nutshell, it consists of a fast, open-source, C++ library that is fully parallelized for dense and sparse array i/o without relying on separate libraries (Dask, Spark, etc.), and works particularly well on AWS S3. We aim to make this technology available as broadly as possible by continuing to build efficient high-level APIs (currently: C, C++, Python, R, Java, Go) and integrations (currently: Dask, Spark, PrestoDB, MariaDB, PDAL, GDAL). Regarding Arrow specifically, we recently used it in our VCF genomics library (see here), and based on that positive experience, we are planning on adding Arrow support to the core library as well.

wsphillips · December 9, 2019, 8:00pm

Very cool. Also for those interested, I also had some extremely thorough and helpful feedback from the Zarr team about where they are going with the new version 3.0 spec and how Zarr relates to TileDB and other packages in this niche: Question/possible enhancement: Relationship to N5 + Arrow/Parquet? · Issue #515 · zarr-developers/zarr-python · GitHub

@ihnorton Anything you want to add/clarify about the comments there?

anon92994695 · December 9, 2019, 9:43pm

Anyone have experience with streaming copy-free arrow in Julia?

I heard this is being done in python and in a few other languages, and was curious to try it myself.

xiaodai · December 9, 2019, 10:16pm

I only read recently that it isnt possible yrt

ihnorton · December 10, 2019, 5:49pm

That was a nice response, thanks for fostering this exchange.

The core architectural choices predate my involvement with TileDB by several years, but Stavros (TileDB’s author) just posted a response outlining how some of those technical decisions were driven by TileDB’s original motivating use-case:

allowing sample additions to sparse arrays over time – for genomic variant calling specifically, the “N+1” problem – offering rapid updates

allowing space, time, and i/o efficient queries on 10M+ ranges on a 100TB sparse array (this is a realistic dataset size for our largest genomics users)

allowing versioning (“time traveling”) of un-consolidated datasets, which is important for auditability, database interaction, etc.

rabernat · December 18, 2019, 8:18pm

For anyone who wants to try out Zarr.jl with data stored in Google Cloud, here is a repo / binder with some examples:

https://github.com/pangeo-data/pangeo-julia-examples/

ToucheSir · September 20, 2020, 10:44pm

Would the new Arrow C data and stream interfaces help with ease of implementation?

mks · November 1, 2020, 11:02am

An interesting in-memory shared object storage provided by Apache Arrow is the Plasma storage

I couldn’t yet find a Julia implementation, so I thought I’d mention it here to raise some evaluation interest.

As an example in Python, there is this project, that adds a namespace to the Plasma storage here: https://github.com/russellromney/brain-plasma

Topic		Replies	Views
JDF - an experimental DataFrame serialization format is ready for beta testing Data	8	2001	September 15, 2019
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	6954	October 25, 2018
The poor state of fileformats for High Performance computing General Usage	16	2610	August 13, 2017
Apache Arrow 1.0 release Data arrow	7	1922	September 5, 2020

Arrow, Feather, and Parquet

Related topics