I would like to read a parquet file into a Julia DataFrame. Unfortunately for my Julia skills (but fortunately for my hard drive) the codec used to compress the parquet file was zstd. I have played around with the CodecZstd and TranscodingStreams along with FileIO and I can’t seem to get it to load much less get the data into a DataFrame. Any help would be appreciated.
Below is the only code I’ve gotten to work but I have no idea what to do with this line output.
using Glob, CodecZstd, DataFrames
p = dirname(@__FILE__)
parent_dir = splitdir(p)[1]
parquet_folder = joinpath(parent_dir, "month_parquet_RS")
println(parquet_folder)
parquet_files = glob("*.parquet", parquet_folder)
parquet_files = parquet_files[1]
proc = open(parquet_files)
stream = ZstdCompressorStream(proc)
for line in eachline(stream)
println(line)
end
close(stream)
I made a pull request that might add support for zstd compressed parquet files. I’ll post here if the pull request gets approved. I’m not sure it will though I have no formal computer science training and I’ve never submitted a pull request before so I’ve probably not done something correct.
Hey, Zstd is working in Diban.jl. I am working (slowly) to try and get this back into Parquet.jl.
But it was a simple addition. If you can’t wait just try Diban.jl
]activate diban-test
]dev Thrift
# add a particular version of Parquet.jl with fixes
]add https://github.com/xiaodaigh/Parquet.jl#zj/fix-reader
# add the latest version of Dìbǎn
]add https://github.com/xiaodaigh/Diban.jl