Trying to read parquet file that is using Zstd codec

ldsands · December 24, 2019, 3:31pm

Hello everyone,

I would like to read a parquet file into a Julia DataFrame. Unfortunately for my Julia skills (but fortunately for my hard drive) the codec used to compress the parquet file was zstd. I have played around with the CodecZstd and TranscodingStreams along with FileIO and I can’t seem to get it to load much less get the data into a DataFrame. Any help would be appreciated.

Below is the only code I’ve gotten to work but I have no idea what to do with this line output.

using Glob, CodecZstd, DataFrames

p = dirname(@__FILE__)
parent_dir = splitdir(p)[1]
parquet_folder = joinpath(parent_dir, "month_parquet_RS")
println(parquet_folder)
parquet_files = glob("*.parquet", parquet_folder)
parquet_files = parquet_files[1]

proc = open(parquet_files)
stream = ZstdCompressorStream(proc)
for line in eachline(stream)
    println(line)
end
close(stream)

xiaodai · December 24, 2019, 9:22pm

Try Parquet Files.jl

ldsands · December 24, 2019, 9:32pm

This is the result of using ParquetFiles.jl (also this is based upon Parquet.jl)

ERROR: LoadError: Unknown compression codec for column chunk: 6

Nither FileIO nor ParquetFiles supports zstd at the moment. I think I need to use CodecZstd but I’m not sure how.

xiaodai · December 24, 2019, 9:42pm

Parquet support for julia is wip

xiaodai · December 24, 2019, 9:51pm

your best bet is to PyCall or RCall to read the parquet, save it as feather and the load it into Julia

ldsands · December 26, 2019, 9:33pm

I made a pull request that might add support for zstd compressed parquet files. I’ll post here if the pull request gets approved. I’m not sure it will though I have no formal computer science training and I’ve never submitted a pull request before so I’ve probably not done something correct.

xiaodai · May 6, 2020, 5:45am

@ldsands

Hey, Zstd is working in Diban.jl. I am working (slowly) to try and get this back into Parquet.jl.

But it was a simple addition. If you can’t wait just try Diban.jl

]activate diban-test
]dev Thrift

# add a particular version of Parquet.jl with fixes
]add https://github.com/xiaodaigh/Parquet.jl#zj/fix-reader

# add the latest version of Dìbǎn
]add https://github.com/xiaodaigh/Diban.jl

And read your ZSTD compressed file using

read_parquet(path_to_file)

Topic		Replies	Views
Neither Parquet.jl nor Parquet2.jl can read my .parquet file Data	7	872	August 31, 2022
Uncompress zst files General Usage	1	213	July 14, 2023
File IO - Parquet File Reader Data	4	1200	October 30, 2018
Reading parquet very slow Data	4	3392	June 14, 2020
Writing Parquet files General Usage	28	5261	November 12, 2020

Trying to read parquet file that is using Zstd codec

Related topics