How to read files from a compressed file (zip/gz) lazily?

I know how to use TranscodingStreams to read one file without fully decompress it first, what if the target.tar.gz contains multiple directories/files?

For zip files I would look at https://github.com/fhs/ZipFile.jl/blob/master/src/ZipFile.jl. Digging into the source there is an example:

julia> r = ZipFile.Reader("/tmp/example.zip");
julia> for f in r.files
          println("Filename: \$(f.name)")
          write(stdout, read(f, String));
       end
julia> close(r)

It appears that each “f” returned by r.files is a stream to read that file.

My quick searching didn’t see anything for tar other than https://github.com/JuliaIO/Tar.jl which doesn’t give you a stream to read a file, it will extract a file or files to a directory and you can read the contents from there.

From what I remember about the tar format it’s kind of annoying from an extraction point of view it’s basically:

  • file1 header
  • file1 contents
  • file2 header
  • file2 contents
  • file3 header
  • file3 contents
  • file4 header
  • file4 contents

Which is great when the file is written to tape (what it was designed for), no so great when you can randomly access the file. This means that if you want to extract file4, you need to read the header for file1, figure out how many bytes to skip, skip them, read the header for file2 figure out how many bytes to skip, skip them, rinse and repeat. Which can turn into a lot of IO operations. If the tar is compressed it gets even worse since you still have to perform the decompression for files 1, 2, and 3 in order to skip them.

1 Like

yes, I ended up trying to use a stream and manually parse the header. It’s extremely slow because Julia readline and string(s, newone) are allocating. Sigh, I gave up yesterday.

for the context, I was parsing the metadata.tar.gz of 200MB just text (when uncompressed it’s like 20G) of video metadatas.

For Tar you can also try to use TarIterators.jl

2 Likes