Hey there, I am currently trying to work with dumps of reddit data which is alot of compressed json data in the zst format (e.g. 11GB for one month of submissions).
When I try to read lines from that data I get an unspecified zst error, probably because the file is too large. Also the code breaks if I read a given number of bytes of the stream. Additionally I tried using mmap and process chunks manually but I think that breaks the codec when I call transcode.
What would be a good approach to handle very large files for decompression?
Can you give an example of what code you are running and what the error looks like? You can enclose code blocks in triple-backticks ``` to get monospace font.
Also, what version of the packages are you using pkg> st --manifest TranscodingStreams CodecZstd?
In the minimal version I am running basically the example from the documentation. I tried it also with a small file and there it works as expected.
using CodecZstd
function read_and_decode(file_path)
s = open(file_path)
reader = ZstdDecompressorStream(s; bufsize=2^31)
for line in readlines(reader)
println(line)
end
close(s)
end
read_and_decode(<path to large file>)
The repo with the monthly files can be found by searching for academictorrents + reddit monthly submissions (I can’t post links apparently…).
There is also a repo linked with python scripts which work for the dumps, so the file per se is not corrupted. But I noticed that I run into an issue when I use the default zst command in bash, so I suspect something similar happens in julia:
RS_2023-03.zst : Decoding error (36) : Frame requires too much memory for decoding
RS_2023-03.zst : Window size larger than maximum : 2147483648 > 134217728
I think this means the data was compressed with the --long[=#] option in the zstd cli tool. You can use the --long=31 flag to decode data with a 2^31 byte window size.
If you have a ton of hard drive space and a good mmap you can also use ChunkCodecLibZstd
julia> using Mmap
julia> using ChunkCodecLibZstd: ChunkCodecCore, ZstdCodec
julia> encoded_data = mmap(open("RS_2023-03.zst"));
julia> decoded_size = ChunkCodecCore.try_find_decoded_size(ZstdCodec(), encoded_data)
161869704726
julia> isnothing(decoded_size) && error("unable to find the decoded size")
false
julia> dst = mmap(open("decoded.jsonl", "w+"), Vector{UInt8}, decoded_size; grow=true);
julia> ChunkCodecCore.decode!(ZstdCodec(), dst, encoded_data);
If not you can use the zstd cli with eachline instead of readlines to avoid using more than about 2GB of memory.
julia> using Zstd_jll
julia> function read_and_decode(file_path)
open(`$(Zstd_jll.zstd()) --decompress --stdout --long=31 $(file_path)`; read=true) do s
n = Int64(0)
for line in eachline(s; keep=true)
n += ncodeunits(line)
end
n
end
end
read_and_decode (generic function with 1 method)
julia> read_and_decode("RS_2023-03.zst")
161869704726