Parquet2.jl 0.2.25 outputting corrupted files (0.2.26 fix now up)

Recently there has been a change in the parquet thrift metadata schema which I included in Parquet2.jl as of 0.2.25. Unfortunately it looks like this is now outputting corrupt files in many circumstances (can confirm that arrow can’t read it). Since Parquet2.jl can read itself fine it seems to suggest that there is something wrong with my implementation of the schema but so far I have been unable to find the discrepancy.

Please avoid 0.2.25. If I am unable to fix this I will have to roll it back. I will get to it as soon as I am able. One way or another I will make another tag by end of day.

4 Likes

Thanks for the warning. I use your package heavily. Turns out I was on 0.2.24, I guess this change must have been very recent as I recently did an ] up

can you make a quick release 0.2.26 that basically rolls back to 0.2.24? bug fix shouldn’t yank things from registry but at the same time this seems a bit bad

1 Like

I reverted the metadata as an emergency measure and I will tag as 0.2.26 as soon as I am able. I also added an additional warning in unit tests because it appears I did not run the compat tests when I merged this which was an egregious oversight.

Still have absolutely no clue what’s happening here. Adding fields in thrift schema is not supposed to be a breaking change, I had certainly tested that at some point. The reason for this update in the first place was that I was failing to read the updated metadata, and now it looks like it’s corrupt when I output it. It’s pretty puzzling because I do not see how it could be reading and writing correctly in the older version but the new version breaks even when all the fields are null.

After this patch, Parquet2.jl will again be unable to read files with the latest metadata format. Ultimately I can just do whatever it takes to make sure that it can read all the metadata formats, but the prospects for figuring out why using the new format produced corrupted thrift are pretty grim, so I’m not very optimistic about being able to update the metadata format.

6 Likes

Ok, 0.2.26 is tagged in general registry. Apologies for that debacle.

I have tested thoroughly on my end but if someone wants to sanity check 0.2.26 and confirm it’s working for them I would appreciate it.

Update: I’ve now gotten a chance to look into this. I still don’t know why I was outputting corrupt thrift, but I definitely see why it was failing to read. It seems I took some rather audacious shortcuts for the sake of type stability. Basically, I use the thrift schema to make everything type stable, but if there are extra fields, I don’t read them in search of the stop indicator because doing so would require duplicating all my functions with non-reading versions (otherwise those functions would be type unstable). So that at least solves one mystery, though unfortunately it will be a fair amount of work to fix.

1 Like

I was never able to figure out why readers fail to read my thrift from the newer metadata format, as my thrift package reads them fine as well as thrift output by other writers. However, I made some major changes to the thrift reader so that it behaves much better when reading. I have tested this and reading newer versions of the metadata should no longer break it.

I have tested extensively, and double checked that pyarrow and fastparquet can read its output fine, but any sanity checks on 0.2.27 would be appreciated.