[ANN] Parquet2.jl

Parquet2.jl

A little something I’ve been working on for the past few weeks. This package is not yet registered because I would like to try to get the community’s help with a bit more testing before registering it.

Parquet2 is fairly well tested against fastparquet and pyarrow (tests against random dataframes using all supported types), but there’s a lot of stuff out there particularly from nasty JVM programs that may do things a bit differently, and there is already good evidence that not all of these take the spec ultra seriously anyway.

So, if you are interested in this package, please give it a shot and report issues! Cases for which you can provide the original dataset will be particularly useful (and the most likely to be resolved).

Why a new package?

There is already a pure Julia parquet package, Parquet.jl. Early on, this package was extremely problematic, but it has gotten much better over time. Special thanks to all the developers who contributed to that over the years who have showed me the way. So why did I write a brand new package instead of working on that? The answer is that I would have had to rewrite nearly all of Parquet.jl to support all of the features that I really wanted. For a partial list of those, see

Features

  • Maximal laziness. Users have full control over which row-groups (subsets of tables) and columns to load. Once columns are loaded, they are AbstractVector wrappers which read data as lazily as possible.
  • Progressive loading and caching of data. Memory mapped by default, but lots of options of you are reading from a remote source. By default, remote files are read in 100 MB blocks meaning you don’t need to load the entire file if they are very large.
  • Full Tables.jl compatibility for both full datasets and individual row groups. A Dataset is an indexable, iterable collection of RowGroups which in turn are indexable, iterable collections of Columns.
  • The user interface is significantly simplified. No need to bother with iterators, everything is directly accessable by RowGroup which are the units of IO according to the parquet spec.
  • Support for extending to read from data sources other than the local file system. This is especially useful because you can take advantage of the aforementioned progressive loading and caching. I already have ParquetS3.jl, a module for loading data from AWS S3 or min.io. Check out how little code it is, i.e. more such modules for other sources are very easy to add.

TODO

  • I didn’t do writing yet, so this is read only for now. I kept writing in mind when doing this and I have a few needed methods for it, so it should be significantly less work for me to add than it took me to do the read part in the first place, but of course there’s still a lot of work ahead.
  • Comprehensive unit testing. Don’t worry, I have good tests against both fastparquet and pyarrow output, but have yet to implement comprehensive unit tests. This is largely because I would like to finish writing first, and hopefully get a bit of feedback from users loading up parquets I haven’t been able to test so I can get an idea of how close this is to being robust.
  • Particularly if we see a lot of issues, I’d like to provide methods for users to easily provide me or other devs with the slices of datasets which are causing failures.
  • A few more here.

Why didn’t you register this?

I want to get a little feedback from users to make sure things mostly work smoothly, and take some more time to try reading JVM outputs. Like I said, I’ve already seen some weird stuff coming out of JVM world, and I expect there to be more, but I need your help to find it!

I will register this as soon as it becomes apparent that dealing with JVM stuff is mostly free of issues.

Are you willing to move this to JuliaData or JuliaIO?

Yes, definitely, if there’s a community consensus that this would be a good thing. I will avoid the arrogance of assuming this will be an “obvious” replacement for Parquet.jl since that package, while certainly less featureful, has had a hell of a lot more time to see weird outputs and work out the kinks.

More docs?

See the full documentation.

36 Likes

sorry for OT (not sure if Discourse renders Gitlab):

struct Page{ℰ,ℋ<:PageHeader}
    header::ℋ
    ℓ::Int
    compressed_ℓ::Int
    crc::Union{Nothing,Int}
    buffer::PageBuffer
end

I’m neutral with unicode, I think given similar key stroke, I’d rather see concise unicode, for example, Hamilt_blah vs. ℋ_blah, but maybe we’re overdoing a bit here? requires 5 keystroke vs. 1 for E, and there’s no clear physics/math convention here and no conflicting symbol.

Other wise, great package, one thing we learned writing UnROOT.jl is that Julia is surprisingly good at doing low-level stuff (GC can be a bit meh but it’s fine), and it can beat complex code base put together by humans which misses some optimizations sometimes.

1 Like

I have largely been using the convention that type parameters are \mathcal. The usefulness of this is more obvious (to me at least) in function signatures. 5 keystrokes is irrelevant to me so it wasn’t a consideration. I do not, in general, consider unicode as being less deserving to be used as a symbol than ASCII, and frankly it’s been at least a couple of years since I’ve been that conscious of the difference in most cases.

3 Likes

I’d like to provide a small status update to keep interested parties informed. I’ll continue providing updates in this thread until I get to the point of registering the package.

I have gotten the chance to test my package on more diverse outputs like those produced by the more unsavory JVM programs and I now have the following top priorities:

Comprehensive set of options for how to load

As I said, I made everything as lazy as possible by default. In some cases, this is much faster, in others, much slower. Rigid settings will not be ideal for all cases and I was seeing some cases where the lazy loading was prohibitively slow. It might be possible to improve this with improvements to LazyArrays.jl, but in the meantime, certain arrays are going to have to be loaded eagerly by default. I have already fixed the worst offenders, so it’s just a matter of adding options and picking reasonable defaults. I will also add some documentation explaining the situation.

I’m going to have to implement nested schema immediately :disappointed:

I had hoped to avoid dealing with nested schema as much as possible since it increases the complexity of reading by an order of magnitude while only being relevant in a tiny minority of cases. Unfortunately

  • Spark seems to really enjoy outputting nested schema even when its nestedness is not apparent from inside spark.
  • Getting reasonable behavior in the presence of nested schema without breaking everything is somewhat more difficult to achieve than I had hoped.

I had explicitly named this a “low priority” but it looks like it’s going to be priority number 1.

The nested schemas themselves are not really the problem, in fact, my schema was a tree rather than an array from the start, the real problem is how deserialization of these schema works. You’d think it would work the same way as the flat case after which one could simply use the metadata to construct the needed structures, but unfortunately this is not the case at all.

Unfortunately solving nested schema in the general case means fully implementing the “dremel” algorithm which is described in this obnoxiously confusing paper, so it will take some time to get fully working.

6 Likes

Awesome stuff. Tested on a local minio instance and worked flawlessly.

Going to build some nice Julia-only data manipulation engine on top!

1 Like

Are you planning to implement support for opening many parquet files at once, like in that Dremel paper? (It looks interesting. I’m reading it to learn something new.)

As a separate note, read functionality (without writes) would enable data analysis on parquet files.

The use case of this in which all data is discoverable from a single metadata file is already implemented, and it may not even be obvious whether you are reading from multiple files or just one.

What’s still missing is some portion of schema discovery even in the absence of the main metadata file. By default, spark seems to write parquet directory trees without any metadata that covers all of it. While you are certainly free to read these in with a loop in Parquet2.jl, I don’t yet have any tools for managing a standardized directory structure, or even for concatenating many parquets in a single directory. Of course, nothing I can do will replace the functionality of real metadata, but I have to do something.

1 Like

Sounds interesting! Do you have any benchmarks between Parquet.jl, Parquet2.jl and CSV.jl?

No, it’s not quite at the point where I’d consider benchmarks particularly relevant. In some cases it’ll seem way faster than Parquet.jl because it’s doing something lazily. Ultimately I expect it to have similar performance as they work pretty similarly.

Comparisons to CSV.jl don’t make a whole lot of sense as the formats work very differently. In the best cases for parquet (basically anything without nulls) it ought to be much faster than CSV.jl. In some of the worst cases for parquet format (about 1/2 nulls or elaborate nested schema), it may be slower.

Anyway, of course at some point I’m going to have to do serious benchmarks relative to fastparquet and pyarrow and I’m sure when that happens I’ll have to fix a few issues. I likely won’t do this until (at least) reading is done.

Again, if anyone is looking for the fastest format, you should just use arrow, parquet will never be as fast as arrow, the format just doesn’t care that much about being fast.

I find that the major difference with Unicode is that I don’t know how to type a lot of the characters. A few are fine (like \pi), but I didn’t know that your symbol names could be typed using \mathcal and wouldn’t have been able to easily contribute to the code.

The Julia REPL can be a big help here:

help?> ℋ
"ℋ" can be typed by \scrH<tab>
5 Likes

I’ve been thinking about this for a couple of months and now I finally have the time to dive in. First of all, THANK YOU for taking this on. Data folks often deal with parquet, especially since Spark is so popular. Conquering this file format will help Julia adoption in so many unexpected ways.

I see a stream of commits since the announcement, even today. This data engineer appreciates the effort.

So - I’m diving in now and testing it out, I’ll post updates here.

2 Likes

Writing was a lot more work than I had hoped (I was able to read my own parquets so I had to debug from within the other implementations to do everything), but I am finally nearing the end. The only major thing I need to do to finish writing is dictionary encoded columns, then I just have some cleanup and testing to do and I’m done.

I plan to tag and register the package as soon as the write part is complete. I also don’t want to wait forever to tag 1.0, and I don’t intend on adding additional features to do so, so once this is registered and I have some confidence that it is usable for everyone I’ll tag a 1.0.

10 Likes

I’ll second @merlin’s comments above - thank you @ExpandingMan for taking on this project. I’m looking forward to using Parquet2.jl.

One comment/question - it doesn’t appear that your implementation has hooks to easily support custom types. Is this a feature you would be willing to add? I’ve found the Arrow.jl implementation to be very well done. I especially appreciate the extensibility to custom types via metadata. If you could find a way to integrate similar functionality into Parquet2.jl I think it would be very useful. Thanks!

1 Like

This would require, at a minimum, implementing arbitrarily deeply nested schema. The parquet standard does include this in the form of dremel encoding, but implementing this is quite involved. I decided early on not to take this on, and I don’t currently have any plans of adding it. (It’s worth pointing out, I just recently had to re-implement page loading which I will discuss below, and this brings me a little bit closer to dremel, but not much.) Obviously if someone else wanted to contribute it I’d be thrilled, but it would be quite an extensive PR.

If at any point Parquet2.jl has support for arbitrarily nested schema, I will explore whether ArrowTypes.jl is general enough to use here, but there will never be a ParquetTypes.jl implemented by me (though, ideally, the nested schema implementation would be general enough that it would be relatively easy for someone else to do).

Alternatively, Parquet2.jl already supports JSON and BSON. Currently those automatically get parsed into dicts, but a good minor feature that I could add at some point would be more arbitrary user-defined handling of these.

In the meantime, please see my comments on choosing a format. In my mind the only reason to ever use parquet instead of arrow is for compatibility (for which it is desperately needed, thus my spending time on it) or in rare cases in which you have very large quantities of data with a large number of null values.

Development Update

I’ll use this opportunity to update those interested on my current progress, since I am so close to a release.

I am very grateful to Anand Bisen for opening this issue and including his dataset. The issue itself is minor (he was using a function which caused the columns to constantly reload, I will address this by updating documentation to better explain what Parquet2.Dataset is and what to do in these cases), but I was able to uncover other issues by reading the dataset he provided.

The table (converted to parquet with pyarrow) reads correctly but it introduced to me a case I wasn’t aware of: that of columns containing a large number (100’s or 1000’s) of pages. Previously I had assumed that in almost all cases columns would consist of just a few pages.

This has required me to support a different method for allocating data for newly loaded columns. This method is not in the slightest bit “lazy”. It seems that in practice, the lazy loading which was such a priority for me is a pretty rare use case for parquet (this is the principle reason I prefer arrow).

I am currently struggling with some performance issues which I need to resolve before release (in this case about a factor of 2 slower than Parquet.jl). It should be a solvable problem since my micro-benchmarks of performance-critical code are looking quite good, but I haven’t managed to track down the culprit yet.

I may wind up largely abandoning lazily loading columns as views because the cases in which this is possible seem so rare in practice (plain encoded numerical columns with no nulls).

6 Likes

Parquet2.jl is now registered!

Please try it out. I plan on tagging a 1.0 once I am confident there are no major issues looming.

15 Likes

I don’t know what you did but this saved my day. I spent half day with Parquet without success and I was able to slice data the way I wanted to in 5 minutes using Parquet2 (learning from docs to getting things done). Thanks for developing this!

4 Likes

Implemented parquet completely from scratch… so a lot more work than I would have liked :laughing:

Anyway, I’m thrilled people are finding this useful, things seem to be going quite well so far. I’ve had a bunch of people file easily reproducible issues so I’ve been able to discover and fix a number of edge cases. When I started this project I had very similar experiences to @youngjae.woo , in theory it seemed like Parquet.jl was mostly working but every time I’d try to use it I’d wind up at the bottom of a rabbit hole, so we really needed this fixed for Julia data stuff.

I think the biggest outstanding issues for tagging a 1.0 all have to do with multi-file datasets. In particular, if you try to infer the schema of a very large parquet file tree on S3 you will generate an excessive number of HTTP calls. Meanwhile, things over in AWSS3.jl and FilePathsBase.jl things have been a mess for a while and only getting worse, a situation not due to any lack of effort in those packages but rather the incredibly annoying fact that S3 is not a file system at all and therefore a parquet file “tree” on S3 is technically not bijective with the analogous (actual) tree on a local file system, a situation that leads to an incredible number of infuriating caveats, “gotchas” and highly dubious hacks.

I am currently engaged in the following:

  • Overhaul AbstractTrees.jl based on the unfinished “cursor” design from a recent PR by Keno.
  • Extend FilePathsBase.jl to implement traits allowing for better default behavior for key-value stores and using AbstractTrees.jl (I have discussed traits with the maintainers but not using AbstractTrees.jl, so it still remains to be seen if I’ll be able to do this).
  • Rewrite AWSS3.jl to use the new extended interface from FilePathsBase.jl

At this point I will (by design) be able to infer the entire directory tree easily in a single HTTP call and Parquet2.jl should be much nicer for dealing with very large file trees on remote sources.

16 Likes

Amazing work; I recently moved to julia for data analysis workflow. Was really worried that Parquet package would take long time to load a dataset that has 100s of columns, i mean more than 12 hours to load a 3 GB parquet directory.
Parquet2 worked quiet nicely with my dataset.
Thanks for the package

3 Likes

You are a genius, man. I remember looking at the parquet filesystem spec being cmpletely lost. funny story, I failed a senior level DE interview mostly because I couldn’t describe the structure of parquet files and whats in them.

New job, another place heavy on Spark, and lot of parquet data. Thank you.