Decompression of large files

Hey there, I am currently trying to work with dumps of reddit data which is alot of compressed json data in the zst format (e.g. 11GB for one month of submissions).

When I try to read lines from that data I get an unspecified zst error, probably because the file is too large. Also the code breaks if I read a given number of bytes of the stream. Additionally I tried using mmap and process chunks manually but I think that breaks the codec when I call transcode.

What would be a good approach to handle very large files for decompression?

Thank you!

Welcome to the Julia discourse.

Can you give an example of what code you are running and what the error looks like? You can enclose code blocks in triple-backticks ``` to get monospace font.
Also, what version of the packages are you using pkg> st --manifest TranscodingStreams CodecZstd?

Thank you for the quick reply!

In the minimal version I am running basically the example from the documentation. I tried it also with a small file and there it works as expected.

using CodecZstd

function read_and_decode(file_path)
    s = open(file_path)
    reader = ZstdDecompressorStream(s; bufsize=2^31)
    for line in readlines(reader)
        println(line)
    end
    close(s)
end

read_and_decode(<path to large file>)

And this is the error this throws:

ERROR: LoadError: zstd error
Stacktrace:
  [1] changemode!(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}, newmode::Symbol)
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:794
  [2] callprocess(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}, inbuf::TranscodingStreams.Buffer, outbuf::TranscodingStreams.Buffer)
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:707
  [3] fillbuffer(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}; eager::Bool)
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:624
  [4] fillbuffer
    @ ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:610 [inlined]
  [5] sloweof(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream})
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:222
  [6] eof
    @ ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:213 [inlined]
  [7] iterate(itr::Base.EachLine{TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}}, state::Nothing)
    @ Base ./io.jl:1235
  [8] iterate
    @ ./io.jl:1235 [inlined]
  [9] _collect(cont::UnitRange{Int64}, itr::Base.EachLine{TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}}, ::Base.HasEltype, isz::Base.SizeUnknown)
    @ Base ./array.jl:727
 [10] collect
    @ ./array.jl:716 [inlined]
 [11] readlines
    @ ./io.jl:708 [inlined]
 [12] read_and_decode(file_path::String)
    @ Main ~/reddit/scripts/PushshiftDumps/minimal.jl:6
 [13] top-level scope
    @ ~/reddit/scripts/PushshiftDumps/minimal.jl:12
in expression starting at /home/marcel/reddit/scripts/PushshiftDumps/minimal.jl:12

The packages versions are the following and I am working with Julia 1.11

  [6b39b394] CodecZstd v0.8.6
  [3bb67fe8] TranscodingStreams v0.11.3

Thanks for the details. Do you have a link to an example of a file? Usually, that error means the file is somehow corrupted.

The repo with the monthly files can be found by searching for academictorrents + reddit monthly submissions (I can’t post links apparently…).

There is also a repo linked with python scripts which work for the dumps, so the file per se is not corrupted. But I noticed that I run into an issue when I use the default zst command in bash, so I suspect something similar happens in julia:

RS_2023-03.zst : Decoding error (36) : Frame requires too much memory for decoding 
RS_2023-03.zst : Window size larger than maximum : 2147483648 > 134217728 
1 Like

I think this means the data was compressed with the --long[=#] option in the zstd cli tool. You can use the --long=31 flag to decode data with a 2^31 byte window size.

Currently, there is no API to adjust this in CodecZstd, but you can use the cli tool in Julia like so.

julia> using Zstd_jll

julia> run(`$(Zstd_jll.zstd()) --help`)

XRef Allow setting `ZSTD_d_windowLogMax` decompression parameter · Issue #79 · JuliaIO/CodecZstd.jl · GitHub Improve error messages · Issue #80 · JuliaIO/CodecZstd.jl · GitHub

I was able to download the file from Reddit comments/submissions 2005-06 to 2024-12 - Academic Torrents

If you have a ton of hard drive space and a good mmap you can also use ChunkCodecLibZstd

julia> using Mmap

julia> using ChunkCodecLibZstd: ChunkCodecCore, ZstdCodec

julia> encoded_data = mmap(open("RS_2023-03.zst"));

julia> decoded_size = ChunkCodecCore.try_find_decoded_size(ZstdCodec(), encoded_data)
161869704726

julia> isnothing(decoded_size) && error("unable to find the decoded size")
false

julia> dst = mmap(open("decoded.jsonl", "w+"), Vector{UInt8}, decoded_size; grow=true);

julia> ChunkCodecCore.decode!(ZstdCodec(), dst, encoded_data);

If not you can use the zstd cli with eachline instead of readlines to avoid using more than about 2GB of memory.

julia> using Zstd_jll

julia> function read_and_decode(file_path)
           open(`$(Zstd_jll.zstd()) --decompress --stdout --long=31 $(file_path)`; read=true) do s
               n = Int64(0)
               for line in eachline(s; keep=true)
                   n += ncodeunits(line)
               end
               n
           end
       end
read_and_decode (generic function with 1 method)

julia> read_and_decode("RS_2023-03.zst")
161869704726

Thanks, that looks super helpful already!
I tried to get it working with chunks yesterday but I think the command needs complete files…

I think it is possible to set the input filename to - and then feed the compressed chunks to the stdin of the running zstd process.