Trying to read parquet file that is using Zstd codec

Hello everyone,

I would like to read a parquet file into a Julia DataFrame. Unfortunately for my Julia skills (but fortunately for my hard drive) the codec used to compress the parquet file was zstd. I have played around with the CodecZstd and TranscodingStreams along with FileIO and I can’t seem to get it to load much less get the data into a DataFrame. Any help would be appreciated.

Below is the only code I’ve gotten to work but I have no idea what to do with this line output.

using Glob, CodecZstd, DataFrames

p = dirname(@__FILE__)
parent_dir = splitdir(p)[1]
parquet_folder = joinpath(parent_dir, "month_parquet_RS")
println(parquet_folder)
parquet_files = glob("*.parquet", parquet_folder)
parquet_files = parquet_files[1]

proc = open(parquet_files)
stream = ZstdCompressorStream(proc)
for line in eachline(stream)
    println(line)
end
close(stream)

Try Parquet Files.jl

This is the result of using ParquetFiles.jl (also this is based upon Parquet.jl)

ERROR: LoadError: Unknown compression codec for column chunk: 6

Nither FileIO nor ParquetFiles supports zstd at the moment. I think I need to use CodecZstd but I’m not sure how.

Parquet support for julia is wip

your best bet is to PyCall or RCall to read the parquet, save it as feather and the load it into Julia

I made a pull request that might add support for zstd compressed parquet files. I’ll post here if the pull request gets approved. I’m not sure it will though I have no formal computer science training and I’ve never submitted a pull request before so I’ve probably not done something correct.

@ldsands

Hey, Zstd is working in Diban.jl. I am working (slowly) to try and get this back into Parquet.jl.

But it was a simple addition. If you can’t wait just try Diban.jl

]activate diban-test
]dev Thrift

# add a particular version of Parquet.jl with fixes
]add https://github.com/xiaodaigh/Parquet.jl#zj/fix-reader

# add the latest version of Dìbǎn
]add https://github.com/xiaodaigh/Diban.jl

And read your ZSTD compressed file using

read_parquet(path_to_file)