Trying to write a parquet writer. Please help!

I am trying to understand enough about parquet to get a parquet writer going.

I am trying to read a parquet file and I am using Thrift.jl. It says at offset 4 there is DictionaryPageHeader so I read that using Thrift.jl but it tells the column has 100k unique values but in reality, it has 300k unique values.

There is only one row-group, so the other unique values are not in the other row groups.

Do anyone know much about how DictionaryHeaders and Dictionary work in Parquet?

Either Thrift.jl is reading the data wrong, or there are multiple dictionaries. The funny thing is, I can actually recover the 100k unique values from the parquet file, which means that it’s correct, but the next page is a data page. So no more dictionaries.

If you any of you know about Dictionaries and DictionaryHeaders, please help.

1 Like

There is finally a working Parquet writer! See

I will start work on a PR to Parquet.jl but if you can’t wait, please help test out Diban.jl