I’d like to announce and request comments on Parquet3.jl. This implementation focuses specifically on support for nested column types, which are explicitly out of scope for Parquet2.jl, but would be a suitable representation for data typical of physics experiments, like waveform data and observables of multiple physics objects in an event.
Strategy
The key design decision is to use Arrow.jl as the user-facing interface because
- the Arrow spec accommodates nested columns with APIs like
ListArray, and - encoding/decoding of nested data between Parquet and Arrow has been implemented in other languages.
FixedSizeList workaround
Note that while Arrow.List is used for jagged arrays, Arrow.FixedSizeList is not used for fixed-size arrays because the type uses NTuple{N,T} as its element type and so every index access copies all N elements to create an N-tuple, which is unacceptable for waveform data where N could be of \mathcal{O}(10^3\mathord{-}10^4).
This package uses two structs, FixedSizeListVector and FixedSizeView, to circumvent the issue by storing data in instances of the first type and returning views into them whenever users try to access them. I’d argue that this should be the default behavior of Arrow.FixedSizeList and would love to hear @quinnj’s thoughts on this.
Why this isn’t a PR to Parquet2.jl
Using Arrow.jl as the interface is also why IMO this implementation would be too much of a breaking change if it were a PR to Parquet2.jl, as that package provides its own user interface.
Priority for Development
At this moment, support and performance for reading nested columns are of the highest priority. So far, optimization has been done for reading dense (i.e. non-nullable) fixed-size lists as that is how waveform data are stored. Support for writing to on-disk files will follow. Full support for all logical types, compression, encodings and nested types MAP and STRUCT is of lower priority, in contrast to Parquet2.jl.