[RFC/ANN] Parquet3.jl — alternative Parquet implementation focused on support for nested columns

Yuan-Ru-Lin · March 19, 2026, 6:47pm

I’d like to announce and request comments on Parquet3.jl. This implementation focuses specifically on support for nested column types, which are explicitly out of scope for Parquet2.jl, but would be a suitable representation for data typical of physics experiments, like waveform data and observables of multiple physics objects in an event.

Strategy

The key design decision is to use Arrow.jl as the user-facing interface because

the Arrow spec accommodates nested columns with APIs like ListArray, and
encoding/decoding of nested data between Parquet and Arrow has been implemented in other languages.

FixedSizeList workaround

Note that while Arrow.List is used for jagged arrays, Arrow.FixedSizeList is not used for fixed-size arrays because the type uses NTuple{N,T} as its element type and so every index access copies all N elements to create an N-tuple, which is unacceptable for waveform data where N could be of \mathcal{O}(10^3\mathord{-}10^4).

This package uses two structs, FixedSizeListVector and FixedSizeView, to circumvent the issue by storing data in instances of the first type and returning views into them whenever users try to access them. I’d argue that this should be the default behavior of Arrow.FixedSizeList and would love to hear @quinnj’s thoughts on this.

Why this isn’t a PR to Parquet2.jl

Using Arrow.jl as the interface is also why IMO this implementation would be too much of a breaking change if it were a PR to Parquet2.jl, as that package provides its own user interface.

Priority for Development

At this moment, support and performance for reading nested columns are of the highest priority. So far, optimization has been done for reading dense (i.e. non-nullable) fixed-size lists as that is how waveform data are stored. Support for writing to on-disk files will follow. Full support for all logical types, compression, encodings and nested types MAP and STRUCT is of lower priority, in contrast to Parquet2.jl.

CC @oschulz @ExpandingMan @jling

quinnj · March 20, 2026, 4:10am

Yeah, having something like FixedSizeView sounds like a great idea for Arrow.jl. I think there are a couple of places in the various array types where we materialize on indexing a little too eagerly.

Overall excited to see more progress here for parquet coverage!

oschulz · March 20, 2026, 9:40am

Awsome @Yuan-Ru-Lin !

nhz2 · March 20, 2026, 1:40pm

You might find GitHub - JuliaIO/ChunkCodecs.jl: A consistent Julia interface for lossless encoding and decoding of bytes in memory · GitHub helpful. Currently your zstd decompression function has a subtle memory leak. Also, I see you have a hadoop lz4 decoder. It would be nice to have that in ChunkCodecs if you can find a way to test it.

croberts · March 20, 2026, 1:49pm

What are the relevant tradeoffs for using HDF5 vs Parquet or other data storage file types?

Jake · March 20, 2026, 4:31pm

I found that Arrow.jl will write to a file without specifying the file size before starting the write. If you have a need for an indeterminate file size this is extremely handy. Unfortunately it takes away the ability to have channel specific metadata as mentioned here. In my use case the data was a matrix from a data acquisition system, where each channel is sampled at the same rate.

Yuan-Ru-Lin · March 20, 2026, 6:56pm

Yeah Arrow.jl should provide views into fixed-size arrays. I’d modify the behavior of FixedSizeList instead of adding different mechanisms because IMO Arrow.jl should have as least materialization as possible because it’s designed to be a spec for data layout. But of course breaking change like this should be discussed more thoroughly.

Yuan-Ru-Lin · March 20, 2026, 6:59pm

Thanks for the suggestion. The memory leak issue is a high priority.

As for incorporating the Hadoop LZ4 decoder, I will let you know once I find a way to test it, though it’s not much of a priority

Yuan-Ru-Lin · March 20, 2026, 11:01pm

That question is particularly of interest to the group I work in right now. Some context: we work in a collaboration where we employ an HDF5-based data format, and both Python and Julia implementations of the format are created and maintained in-house.

So far, we have found that switching to Parquet would save a lot of maintenance efforts for the HDF5-based implementations in the collaboration because not much custom implementation would be needed. Parquet also has better compression ratio and better loading speed for our waveform data, provided that the right encoding is used.

Another benefit of Parquet is its compatibility with modern tools used and battle-tested in industry, such as Polars, DuckDB and Spark. Our data are partitioned in tiers and channels, and selecting data by aggregating information across all tiers, and parallelizing work across channels, has been a challenge to newcomers to the collaboration. They have to write custom Python scripts to process HDF5 files, whereas Parquet files can be processed with more robust APIs provided by these tools.

That said, if your data are huge matrices that cannot be naturally described with a relational model, HDF5 might be a good fit. But if your data are like ours, relational and occasionally dense but can still fit in a Parquet file with the right encoding, then Parquet will be suitable.

rkube · April 10, 2026, 2:59am

Fantastic, great job! I’m using Parquet quiet a bit, looking forward giving it a try.

Topic		Replies	Views
Writing Parquet files General Usage	28	5684	November 12, 2020
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7830	May 8, 2024
Arrow, Feather, and Parquet Data parquet , arrow	48	13716	November 1, 2020
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	1027	August 31, 2022
Displaying a parquet file in Arrow New to Julia dataframes , parquet , arrow	7	1653	March 17, 2021

[RFC/ANN] Parquet3.jl — alternative Parquet implementation focused on support for nested columns

Strategy

FixedSizeList workaround

Why this isn’t a PR to Parquet2.jl

Priority for Development

Related topics