Parquet2 RFC: removing `CacheVector`

ExpandingMan · September 7, 2022, 5:13pm

When I wrote Parquet2.jl, I had what at the time I thought was a terribly clever idea about only loading specific subsets of the full buffer. My motivation for this was based on the observation that, while parquet is already partitioned by rows, it is very common to find parquet files with an absurd number of columns and for the vast majority of those to be useless for the task at hand. This still seems like a good idea in principle: if, hypothetically, somebody wrote a 4 GB (the maximum size) parquet file with thousands of totally useless columns but only a single row group, there are potentially huge savings to be had. That I can at least contrive a scenario where my caching scheme would lead to huge speedups is the reason I was feeling nervous enough to write this post before ripping the package apart.

In practice, I have not found it very likely that my caching scheme will be useful. The reason is that, of course, the people behind parquet already understood this problem and while perhaps they have not solved it the way I would have preferred, it is indeed addressed by the fact that most parquets consist of a large number of row-groups of \sim 200~\mbox{MB} each. In such a scenario my caching scheme is probably not going to be useful, particularly in remote contexts because it can generate a huge number of IO calls. I then have to optimize between the competing factors of too many calls and too few bytes per call.

I feel vaguely proud that I have implemented Parquet2.jl on top of CacheVector with approximately zero performance overhead (this statement is based on microbenchmarks only), however I’ve recently started having the horrible feeling that I’m not going to have any choice but to one day implement arbitrarily nested structures (this is going to be absolutely no fun whatsoever) and, while I’m satisfied with CacheVector for the package as it sits today, the reality is that CacheVector significantly increases the complexity of many would-be improvements.

A list of disadvantages of CacheVector (as opposed to a Vector{UInt8}) follows:

While what it does currently is very efficient, and, I would argue, a testament to Julia that so much abstraction results in 0 overhead, it will inevitably be extra work to keep it that way in the face of major changes.
Its presence drastically increases the complexity of configuration arguments to Parquet2.Dataset since it has many parameters.
It provides no benefit whatsoever when files can be memory mapped since memory mapping is already doing everything that CacheVector would do in a much better way.
If you are not careful it can be used to produce a ton of expensive IO calls.

Anyway, I doubt that anyone has even looked into Parquet2.jl enough to realize that CacheVector exists, so this post is probably pointless, but it was cathartic to get my thoughts out there on the topic and this will provide an opportunity for someone to come along and say “for the love of god, don’t get rid of it” before I go and rip it out.

Topic		Replies	Views
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7437	May 8, 2024
Reading parquet very slow Data	4	3386	June 14, 2020
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	867	August 31, 2022
Write Large Parquet to S3 General Usage parquet	6	315	August 9, 2023
Repartitioning 2TB of csv into parquets Data big-data	21	4671	June 25, 2020

Parquet2 RFC: removing `CacheVector`

Related topics