Parquet2.jl 0.2.15 thrift metadata not readable by pyarrow: fixed by Thrift2 0.1.3!

Today this issue was filed. Alarmingly, this indicates that the output of Thrift2.jl, despite being read properly by Thrift.jl, Thrift2.jl and the Python package fastparquet, is not currently being read properly by pyarrow. Given that 3 completely separate implementations read the output, it’s quite possible that the issue lies with pyarrow, but unfortunately the thrift spec is somewhat vague (it’s literally called “the missing spec”) and due to its popularity I don’t think there’s any choice but to consider the pyarrow implementation definitive.

Please cap Parquet2.jl to 0.2.13 for the time being.
the issue has since been fixed! please update to latest

4 Likes

The thrift spec explicitly states that Int8’s are special-cased in that they are written bare without the binary encoding scheme (varint composed with zigzag) used by all other integers. Sadly, I had missed this. The reason I did not notice it before is that Int8’s in the metadata are rather rare, I was reading them back the same way I was writing them (so Thrift2.jl itself saw no issues) and in the offending case it occurred in a piece of metadata that is rarely read or validated by parquet implementations (it’s essentially redundant integer type metadata).

This has been fixed in Thrift2.jl 0.1.3 which will be available via the general registry once that PR merges. Once it does I will patch Parquet2.jl to require Thrift2.jl version 0.1.3 or later.

This was rather scary as I’m fully aware that Parquet2.jl writing bad files means that users can wind up with subtly corrupt files lying around without realizing there is anything wrong with them. I therefore have come around to the fact that I must implement a suite of tests using pyarrow which I had thus far been avoiding due to dependency issues, so that’ll be coming in the not-distant future.

11 Likes

Whoa, thanks for the speed and your fastidious work!

3 Likes