I’d like to announce and request comments on Parquet3.jl. This implementation focuses specifically on support for nested column types, which are explicitly out of scope for Parquet2.jl, but would be a suitable representation for data typical of physics experiments, like waveform data and observables of multiple physics objects in an event.
Strategy
The key design decision is to use Arrow.jl as the user-facing interface because
the Arrow spec accommodates nested columns with APIs like ListArray, and
Note that while Arrow.List is used for jagged arrays, Arrow.FixedSizeList is not used for fixed-size arrays because the type uses NTuple{N,T} as its element type and so every index access copies all N elements to create an N-tuple, which is unacceptable for waveform data where N could be of \mathcal{O}(10^3\mathord{-}10^4).
This package uses two structs, FixedSizeListVector and FixedSizeView, to circumvent the issue by storing data in instances of the first type and returning views into them whenever users try to access them. I’d argue that this should be the default behavior of Arrow.FixedSizeList and would love to hear @quinnj’s thoughts on this.
Why this isn’t a PR to Parquet2.jl
Using Arrow.jl as the interface is also why IMO this implementation would be too much of a breaking change if it were a PR to Parquet2.jl, as that package provides its own user interface.
Priority for Development
At this moment, support and performance for reading nested columns are of the highest priority. So far, optimization has been done for reading dense (i.e. non-nullable) fixed-size lists as that is how waveform data are stored. Support for writing to on-disk files will follow. Full support for all logical types, compression, encodings and nested types MAP and STRUCT is of lower priority, in contrast to Parquet2.jl.
Yeah, having something like FixedSizeView sounds like a great idea for Arrow.jl. I think there are a couple of places in the various array types where we materialize on indexing a little too eagerly.
Overall excited to see more progress here for parquet coverage!
I found that Arrow.jl will write to a file without specifying the file size before starting the write. If you have a need for an indeterminate file size this is extremely handy. Unfortunately it takes away the ability to have channel specific metadata as mentioned here. In my use case the data was a matrix from a data acquisition system, where each channel is sampled at the same rate.
Yeah Arrow.jl should provide views into fixed-size arrays. I’d modify the behavior of FixedSizeList instead of adding different mechanisms because IMO Arrow.jl should have as least materialization as possible because it’s designed to be a spec for data layout. But of course breaking change like this should be discussed more thoroughly.
That question is particularly of interest to the group I work in right now. Some context: we work in a collaboration where we employ an HDF5-based data format, and both Python and Julia implementations of the format are created and maintained in-house.
So far, we have found that switching to Parquet would save a lot of maintenance efforts for the HDF5-based implementations in the collaboration because not much custom implementation would be needed. Parquet also has better compression ratio and better loading speed for our waveform data, provided that the right encoding is used.
Another benefit of Parquet is its compatibility with modern tools used and battle-tested in industry, such as Polars, DuckDB and Spark. Our data are partitioned in tiers and channels, and selecting data by aggregating information across all tiers, and parallelizing work across channels, has been a challenge to newcomers to the collaboration. They have to write custom Python scripts to process HDF5 files, whereas Parquet files can be processed with more robust APIs provided by these tools.
That said, if your data are huge matrices that cannot be naturally described with a relational model, HDF5 might be a good fit. But if your data are like ours, relational and occasionally dense but can still fit in a Parquet file with the right encoding, then Parquet will be suitable.