I had seen those, but both looked crufty/neglected…really hoping something fresh and more friendly to non-MPI methods of distributed computing pans out. HDF5 has always been there, but the codebase is literally millions of lines and its maintained by a small number of developers (bless their souls), so updates to the standard come very slowly and very incrementally. Hard for it to keep up with changing landscape of data science needs (unfortunately!)
@wsphillips thanks for the mention of TileDB. This thread and some of the links were very interesting to read. @wesm makes the important point that existing columnar and scientific data formats were designed to solve distinct problems:
However, with TileDB we aim to bridge the gap between scientific and columnar formats with a single efficient, cloud-optimized array format. TileDB provides fast dense array support, solving various limitations of HDF5 (e.g., parallel writes). But TileDB can also model dataframes with its sparse array support, by selecting a subset of its columns to represent the dimensions (see this discussion). Our integrations with Spark, MariaDB and PrestoDB show how we can feed TileDB arrays into SQL-based query engines. We are currently working to round out TileDB’s dataframe capability with heterogenous dimension types, support for string dimensions, and predicate push-down, which will be released in the next few months.
We’ve recently discussed our vision to simplify data science through the integration of careful format design and efficient storage engine library, capable of updates, partitioning logic, and optimizations. For those not familiar with TileDB: in a nutshell, it consists of a fast, open-source, C++ library that is fully parallelized for dense and sparse array i/o without relying on separate libraries (Dask, Spark, etc.), and works particularly well on AWS S3. We aim to make this technology available as broadly as possible by continuing to build efficient high-level APIs (currently: C, C++, Python, R, Java, Go) and integrations (currently: Dask, Spark, PrestoDB, MariaDB, PDAL, GDAL). Regarding Arrow specifically, we recently used it in our VCF genomics library (see here), and based on that positive experience, we are planning on adding Arrow support to the core library as well.
Very cool. Also for those interested, I also had some extremely thorough and helpful feedback from the Zarr team about where they are going with the new version 3.0 spec and how Zarr relates to TileDB and other packages in this niche: https://github.com/zarr-developers/zarr-python/issues/515
@ihnorton Anything you want to add/clarify about the comments there?
Anyone have experience with streaming copy-free arrow in Julia?
I heard this is being done in python and in a few other languages, and was curious to try it myself.
I only read recently that it isnt possible yrt
That was a nice response, thanks for fostering this exchange.
The core architectural choices predate my involvement with TileDB by several years, but Stavros (TileDB’s author) just posted a response outlining how some of those technical decisions were driven by TileDB’s original motivating use-case:
- allowing sample additions to sparse arrays over time – for genomic variant calling specifically, the “N+1” problem – offering rapid updates
- allowing space, time, and i/o efficient queries on 10M+ ranges on a 100TB sparse array (this is a realistic dataset size for our largest genomics users)
- allowing versioning (“time traveling”) of un-consolidated datasets, which is important for auditability, database interaction, etc.
For anyone who wants to try out Zarr.jl with data stored in Google Cloud, here is a repo / binder with some examples: