Decompression of large files

mdotend · September 8, 2025, 2:51pm

Hey there, I am currently trying to work with dumps of reddit data which is alot of compressed json data in the zst format (e.g. 11GB for one month of submissions).

When I try to read lines from that data I get an unspecified zst error, probably because the file is too large. Also the code breaks if I read a given number of bytes of the stream. Additionally I tried using mmap and process chunks manually but I think that breaks the codec when I call transcode.

What would be a good approach to handle very large files for decompression?

Thank you!

nhz2 · September 8, 2025, 3:26pm

Welcome to the Julia discourse.

Can you give an example of what code you are running and what the error looks like? You can enclose code blocks in triple-backticks ``` to get monospace font.
Also, what version of the packages are you using pkg> st --manifest TranscodingStreams CodecZstd?

mdotend · September 8, 2025, 3:35pm

Thank you for the quick reply!

In the minimal version I am running basically the example from the documentation. I tried it also with a small file and there it works as expected.

using CodecZstd

function read_and_decode(file_path)
    s = open(file_path)
    reader = ZstdDecompressorStream(s; bufsize=2^31)
    for line in readlines(reader)
        println(line)
    end
    close(s)
end

read_and_decode(<path to large file>)

And this is the error this throws:

ERROR: LoadError: zstd error
Stacktrace:
  [1] changemode!(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}, newmode::Symbol)
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:794
  [2] callprocess(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}, inbuf::TranscodingStreams.Buffer, outbuf::TranscodingStreams.Buffer)
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:707
  [3] fillbuffer(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}; eager::Bool)
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:624
  [4] fillbuffer
    @ ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:610 [inlined]
  [5] sloweof(stream::TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream})
    @ TranscodingStreams ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:222
  [6] eof
    @ ~/.julia/packages/TranscodingStreams/O3BYF/src/stream.jl:213 [inlined]
  [7] iterate(itr::Base.EachLine{TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}}, state::Nothing)
    @ Base ./io.jl:1235
  [8] iterate
    @ ./io.jl:1235 [inlined]
  [9] _collect(cont::UnitRange{Int64}, itr::Base.EachLine{TranscodingStreams.TranscodingStream{ZstdDecompressor, IOStream}}, ::Base.HasEltype, isz::Base.SizeUnknown)
    @ Base ./array.jl:727
 [10] collect
    @ ./array.jl:716 [inlined]
 [11] readlines
    @ ./io.jl:708 [inlined]
 [12] read_and_decode(file_path::String)
    @ Main ~/reddit/scripts/PushshiftDumps/minimal.jl:6
 [13] top-level scope
    @ ~/reddit/scripts/PushshiftDumps/minimal.jl:12
in expression starting at /home/marcel/reddit/scripts/PushshiftDumps/minimal.jl:12

The packages versions are the following and I am working with Julia 1.11

  [6b39b394] CodecZstd v0.8.6
  [3bb67fe8] TranscodingStreams v0.11.3

nhz2 · September 8, 2025, 4:05pm

Thanks for the details. Do you have a link to an example of a file? Usually, that error means the file is somehow corrupted.

mdotend · September 8, 2025, 4:25pm

The repo with the monthly files can be found by searching for academictorrents + reddit monthly submissions (I can’t post links apparently…).

There is also a repo linked with python scripts which work for the dumps, so the file per se is not corrupted. But I noticed that I run into an issue when I use the default zst command in bash, so I suspect something similar happens in julia:

RS_2023-03.zst : Decoding error (36) : Frame requires too much memory for decoding 
RS_2023-03.zst : Window size larger than maximum : 2147483648 > 134217728

nhz2 · September 8, 2025, 5:23pm

I think this means the data was compressed with the --long[=#] option in the zstd cli tool. You can use the --long=31 flag to decode data with a 2^31 byte window size.

nhz2 · September 8, 2025, 5:39pm

Currently, there is no API to adjust this in CodecZstd, but you can use the cli tool in Julia like so.

julia> using Zstd_jll

julia> run(`$(Zstd_jll.zstd()) --help`)

XRef Allow setting `ZSTD_d_windowLogMax` decompression parameter · Issue #79 · JuliaIO/CodecZstd.jl · GitHub Improve error messages · Issue #80 · JuliaIO/CodecZstd.jl · GitHub

nhz2 · September 8, 2025, 9:06pm

I was able to download the file from Reddit comments/submissions 2005-06 to 2024-12 - Academic Torrents

If you have a ton of hard drive space and a good mmap you can also use ChunkCodecLibZstd

julia> using Mmap

julia> using ChunkCodecLibZstd: ChunkCodecCore, ZstdCodec

julia> encoded_data = mmap(open("RS_2023-03.zst"));

julia> decoded_size = ChunkCodecCore.try_find_decoded_size(ZstdCodec(), encoded_data)
161869704726

julia> isnothing(decoded_size) && error("unable to find the decoded size")
false

julia> dst = mmap(open("decoded.jsonl", "w+"), Vector{UInt8}, decoded_size; grow=true);

julia> ChunkCodecCore.decode!(ZstdCodec(), dst, encoded_data);

If not you can use the zstd cli with eachline instead of readlines to avoid using more than about 2GB of memory.

julia> using Zstd_jll

julia> function read_and_decode(file_path)
           open(`$(Zstd_jll.zstd()) --decompress --stdout --long=31 $(file_path)`; read=true) do s
               n = Int64(0)
               for line in eachline(s; keep=true)
                   n += ncodeunits(line)
               end
               n
           end
       end
read_and_decode (generic function with 1 method)

julia> read_and_decode("RS_2023-03.zst")
161869704726

mdotend · September 9, 2025, 3:18pm

Thanks, that looks super helpful already!
I tried to get it working with chunks yesterday but I think the command needs complete files…

nhz2 · September 9, 2025, 7:17pm

I think it is possible to set the input filename to - and then feed the compressed chunks to the stdin of the running zstd process.

nhz2 · October 5, 2025, 3:23am

This PR should add support for this to CodecZstd.jl

nhz2 · November 8, 2025, 6:40pm

This was just released in CodecZstd v0.8.7

Topic		Replies	Views
Uncompress zst files General Usage	1	240	July 14, 2023
[ANN] TranscodingStreams.jl - new APIs to zlib, bzip2, xz, zstd and more! Community package , announcement	2	1480	August 18, 2017
Trying to read parquet file that is using Zstd codec General Usage question , dataframes , zstd , parquet	6	1751	May 6, 2020
How to decompress .xz files/ How to use streams? General Usage question , package , codecxz	6	710	September 3, 2022
How to unpack an .xz file with Julia General Usage question	7	726	April 26, 2021

Decompression of large files

Related topics