[ANN] Parquet2.jl

Parquet2.jl

A little something I’ve been working on for the past few weeks. This package is not yet registered because I would like to try to get the community’s help with a bit more testing before registering it.

Parquet2 is fairly well tested against fastparquet and pyarrow (tests against random dataframes using all supported types), but there’s a lot of stuff out there particularly from nasty JVM programs that may do things a bit differently, and there is already good evidence that not all of these take the spec ultra seriously anyway.

So, if you are interested in this package, please give it a shot and report issues! Cases for which you can provide the original dataset will be particularly useful (and the most likely to be resolved).

Why a new package?

There is already a pure Julia parquet package, Parquet.jl. Early on, this package was extremely problematic, but it has gotten much better over time. Special thanks to all the developers who contributed to that over the years who have showed me the way. So why did I write a brand new package instead of working on that? The answer is that I would have had to rewrite nearly all of Parquet.jl to support all of the features that I really wanted. For a partial list of those, see

Features

  • Maximal laziness. Users have full control over which row-groups (subsets of tables) and columns to load. Once columns are loaded, they are AbstractVector wrappers which read data as lazily as possible.
  • Progressive loading and caching of data. Memory mapped by default, but lots of options of you are reading from a remote source. By default, remote files are read in 100 MB blocks meaning you don’t need to load the entire file if they are very large.
  • Full Tables.jl compatibility for both full datasets and individual row groups. A Dataset is an indexable, iterable collection of RowGroups which in turn are indexable, iterable collections of Columns.
  • The user interface is significantly simplified. No need to bother with iterators, everything is directly accessable by RowGroup which are the units of IO according to the parquet spec.
  • Support for extending to read from data sources other than the local file system. This is especially useful because you can take advantage of the aforementioned progressive loading and caching. I already have ParquetS3.jl, a module for loading data from AWS S3 or min.io. Check out how little code it is, i.e. more such modules for other sources are very easy to add.

TODO

  • I didn’t do writing yet, so this is read only for now. I kept writing in mind when doing this and I have a few needed methods for it, so it should be significantly less work for me to add than it took me to do the read part in the first place, but of course there’s still a lot of work ahead.
  • Comprehensive unit testing. Don’t worry, I have good tests against both fastparquet and pyarrow output, but have yet to implement comprehensive unit tests. This is largely because I would like to finish writing first, and hopefully get a bit of feedback from users loading up parquets I haven’t been able to test so I can get an idea of how close this is to being robust.
  • Particularly if we see a lot of issues, I’d like to provide methods for users to easily provide me or other devs with the slices of datasets which are causing failures.
  • A few more here.

Why didn’t you register this?

I want to get a little feedback from users to make sure things mostly work smoothly, and take some more time to try reading JVM outputs. Like I said, I’ve already seen some weird stuff coming out of JVM world, and I expect there to be more, but I need your help to find it!

I will register this as soon as it becomes apparent that dealing with JVM stuff is mostly free of issues.

Are you willing to move this to JuliaData or JuliaIO?

Yes, definitely, if there’s a community consensus that this would be a good thing. I will avoid the arrogance of assuming this will be an “obvious” replacement for Parquet.jl since that package, while certainly less featureful, has had a hell of a lot more time to see weird outputs and work out the kinks.

More docs?

See the full documentation.

23 Likes

sorry for OT (not sure if Discourse renders Gitlab):

struct Page{ℰ,ℋ<:PageHeader}
    header::ℋ
    ℓ::Int
    compressed_ℓ::Int
    crc::Union{Nothing,Int}
    buffer::PageBuffer
end

I’m neutral with unicode, I think given similar key stroke, I’d rather see concise unicode, for example, Hamilt_blah vs. ℋ_blah, but maybe we’re overdoing a bit here? requires 5 keystroke vs. 1 for E, and there’s no clear physics/math convention here and no conflicting symbol.

Other wise, great package, one thing we learned writing UnROOT.jl is that Julia is surprisingly good at doing low-level stuff (GC can be a bit meh but it’s fine), and it can beat complex code base put together by humans which misses some optimizations sometimes.

I have largely been using the convention that type parameters are \mathcal. The usefulness of this is more obvious (to me at least) in function signatures. 5 keystrokes is irrelevant to me so it wasn’t a consideration. I do not, in general, consider unicode as being less deserving to be used as a symbol than ASCII, and frankly it’s been at least a couple of years since I’ve been that conscious of the difference in most cases.

2 Likes

I’d like to provide a small status update to keep interested parties informed. I’ll continue providing updates in this thread until I get to the point of registering the package.

I have gotten the chance to test my package on more diverse outputs like those produced by the more unsavory JVM programs and I now have the following top priorities:

Comprehensive set of options for how to load

As I said, I made everything as lazy as possible by default. In some cases, this is much faster, in others, much slower. Rigid settings will not be ideal for all cases and I was seeing some cases where the lazy loading was prohibitively slow. It might be possible to improve this with improvements to LazyArrays.jl, but in the meantime, certain arrays are going to have to be loaded eagerly by default. I have already fixed the worst offenders, so it’s just a matter of adding options and picking reasonable defaults. I will also add some documentation explaining the situation.

I’m going to have to implement nested schema immediately :disappointed:

I had hoped to avoid dealing with nested schema as much as possible since it increases the complexity of reading by an order of magnitude while only being relevant in a tiny minority of cases. Unfortunately

  • Spark seems to really enjoy outputting nested schema even when its nestedness is not apparent from inside spark.
  • Getting reasonable behavior in the presence of nested schema without breaking everything is somewhat more difficult to achieve than I had hoped.

I had explicitly named this a “low priority” but it looks like it’s going to be priority number 1.

The nested schemas themselves are not really the problem, in fact, my schema was a tree rather than an array from the start, the real problem is how deserialization of these schema works. You’d think it would work the same way as the flat case after which one could simply use the metadata to construct the needed structures, but unfortunately this is not the case at all.

Unfortunately solving nested schema in the general case means fully implementing the “dremel” algorithm which is described in this obnoxiously confusing paper, so it will take some time to get fully working.

5 Likes

Awesome stuff. Tested on a local minio instance and worked flawlessly.

Going to build some nice Julia-only data manipulation engine on top!

1 Like

Are you planning to implement support for opening many parquet files at once, like in that Dremel paper? (It looks interesting. I’m reading it to learn something new.)

As a separate note, read functionality (without writes) would enable data analysis on parquet files.

The use case of this in which all data is discoverable from a single metadata file is already implemented, and it may not even be obvious whether you are reading from multiple files or just one.

What’s still missing is some portion of schema discovery even in the absence of the main metadata file. By default, spark seems to write parquet directory trees without any metadata that covers all of it. While you are certainly free to read these in with a loop in Parquet2.jl, I don’t yet have any tools for managing a standardized directory structure, or even for concatenating many parquets in a single directory. Of course, nothing I can do will replace the functionality of real metadata, but I have to do something.

1 Like

Sounds interesting! Do you have any benchmarks between Parquet.jl, Parquet2.jl and CSV.jl?

No, it’s not quite at the point where I’d consider benchmarks particularly relevant. In some cases it’ll seem way faster than Parquet.jl because it’s doing something lazily. Ultimately I expect it to have similar performance as they work pretty similarly.

Comparisons to CSV.jl don’t make a whole lot of sense as the formats work very differently. In the best cases for parquet (basically anything without nulls) it ought to be much faster than CSV.jl. In some of the worst cases for parquet format (about 1/2 nulls or elaborate nested schema), it may be slower.

Anyway, of course at some point I’m going to have to do serious benchmarks relative to fastparquet and pyarrow and I’m sure when that happens I’ll have to fix a few issues. I likely won’t do this until (at least) reading is done.

Again, if anyone is looking for the fastest format, you should just use arrow, parquet will never be as fast as arrow, the format just doesn’t care that much about being fast.