A little something I’ve been working on for the past few weeks. This package is not yet registered because I would like to try to get the community’s help with a bit more testing before registering it.
Parquet2 is fairly well tested against fastparquet and pyarrow (tests against random dataframes using all supported types), but there’s a lot of stuff out there particularly from nasty JVM programs that may do things a bit differently, and there is already good evidence that not all of these take the spec ultra seriously anyway.
So, if you are interested in this package, please give it a shot and report issues! Cases for which you can provide the original dataset will be particularly useful (and the most likely to be resolved).
There is already a pure Julia parquet package, Parquet.jl. Early on, this package was extremely problematic, but it has gotten much better over time. Special thanks to all the developers who contributed to that over the years who have showed me the way. So why did I write a brand new package instead of working on that? The answer is that I would have had to rewrite nearly all of Parquet.jl to support all of the features that I really wanted. For a partial list of those, see
- Maximal laziness. Users have full control over which row-groups (subsets of tables) and columns to load. Once columns are loaded, they are
AbstractVectorwrappers which read data as lazily as possible.
- Progressive loading and caching of data. Memory mapped by default, but lots of options of you are reading from a remote source. By default, remote files are read in 100 MB blocks meaning you don’t need to load the entire file if they are very large.
- Full Tables.jl compatibility for both full datasets and individual row groups. A
Datasetis an indexable, iterable collection of
RowGroups which in turn are indexable, iterable collections of
- The user interface is significantly simplified. No need to bother with iterators, everything is directly accessable by
RowGroupwhich are the units of IO according to the parquet spec.
- Support for extending to read from data sources other than the local file system. This is especially useful because you can take advantage of the aforementioned progressive loading and caching. I already have ParquetS3.jl, a module for loading data from AWS S3 or min.io. Check out how little code it is, i.e. more such modules for other sources are very easy to add.
- I didn’t do writing yet, so this is read only for now. I kept writing in mind when doing this and I have a few needed methods for it, so it should be significantly less work for me to add than it took me to do the read part in the first place, but of course there’s still a lot of work ahead.
- Comprehensive unit testing. Don’t worry, I have good tests against both fastparquet and pyarrow output, but have yet to implement comprehensive unit tests. This is largely because I would like to finish writing first, and hopefully get a bit of feedback from users loading up parquets I haven’t been able to test so I can get an idea of how close this is to being robust.
- Particularly if we see a lot of issues, I’d like to provide methods for users to easily provide me or other devs with the slices of datasets which are causing failures.
- A few more here.
I want to get a little feedback from users to make sure things mostly work smoothly, and take some more time to try reading JVM outputs. Like I said, I’ve already seen some weird stuff coming out of JVM world, and I expect there to be more, but I need your help to find it!
I will register this as soon as it becomes apparent that dealing with JVM stuff is mostly free of issues.
Yes, definitely, if there’s a community consensus that this would be a good thing. I will avoid the arrogance of assuming this will be an “obvious” replacement for Parquet.jl since that package, while certainly less featureful, has had a hell of a lot more time to see weird outputs and work out the kinks.