Trying to read parquet file that is using Zstd codec

Hello everyone,

I would like to read a parquet file into a Julia DataFrame. Unfortunately for my Julia skills (but fortunately for my hard drive) the codec used to compress the parquet file was zstd. I have played around with the CodecZstd and TranscodingStreams along with FileIO and I can’t seem to get it to load much less get the data into a DataFrame. Any help would be appreciated.

Below is the only code I’ve gotten to work but I have no idea what to do with this line output.

using Glob, CodecZstd, DataFrames

p = dirname(@__FILE__)
parent_dir = splitdir(p)[1]
parquet_folder = joinpath(parent_dir, "month_parquet_RS")
println(parquet_folder)
parquet_files = glob("*.parquet", parquet_folder)
parquet_files = parquet_files[1]

proc = open(parquet_files)
stream = ZstdCompressorStream(proc)
for line in eachline(stream)
    println(line)
end
close(stream)

Try Parquet Files.jl

1 Like

This is the result of using ParquetFiles.jl (also this is based upon Parquet.jl)

ERROR: LoadError: Unknown compression codec for column chunk: 6

Nither FileIO nor ParquetFiles supports zstd at the moment. I think I need to use CodecZstd but I’m not sure how.

1 Like

Parquet support for julia is wip

2 Likes

your best bet is to PyCall or RCall to read the parquet, save it as feather and the load it into Julia

2 Likes

I made a pull request that might add support for zstd compressed parquet files. I’ll post here if the pull request gets approved. I’m not sure it will though I have no formal computer science training and I’ve never submitted a pull request before so I’ve probably not done something correct.

@ldsands

Hey, Zstd is working in Diban.jl. I am working (slowly) to try and get this back into Parquet.jl.

But it was a simple addition. If you can’t wait just try Diban.jl

]activate diban-test
]dev Thrift

# add a particular version of Parquet.jl with fixes
]add https://github.com/xiaodaigh/Parquet.jl#zj/fix-reader

# add the latest version of Dìbǎn
]add https://github.com/xiaodaigh/Diban.jl

And read your ZSTD compressed file using

read_parquet(path_to_file)