How to read files from a compressed file (zip/gz) lazily?

jling · January 14, 2021, 1:28am

I know how to use TranscodingStreams to read one file without fully decompress it first, what if the target.tar.gz contains multiple directories/files?

pixel27 · January 14, 2021, 5:44pm

For zip files I would look at https://github.com/fhs/ZipFile.jl/blob/master/src/ZipFile.jl. Digging into the source there is an example:

julia> r = ZipFile.Reader("/tmp/example.zip");
julia> for f in r.files
          println("Filename: \$(f.name)")
          write(stdout, read(f, String));
       end
julia> close(r)

It appears that each “f” returned by r.files is a stream to read that file.

My quick searching didn’t see anything for tar other than https://github.com/JuliaIO/Tar.jl which doesn’t give you a stream to read a file, it will extract a file or files to a directory and you can read the contents from there.

From what I remember about the tar format it’s kind of annoying from an extraction point of view it’s basically:

file1 header
file1 contents
file2 header
file2 contents
file3 header
file3 contents
file4 header
file4 contents

Which is great when the file is written to tape (what it was designed for), no so great when you can randomly access the file. This means that if you want to extract file4, you need to read the header for file1, figure out how many bytes to skip, skip them, read the header for file2 figure out how many bytes to skip, skip them, rinse and repeat. Which can turn into a lot of IO operations. If the tar is compressed it gets even worse since you still have to perform the decompression for files 1, 2, and 3 in order to skip them.

jling · January 14, 2021, 5:48pm

yes, I ended up trying to use a stream and manually parse the header. It’s extremely slow because Julia readline and string(s, newone) are allocating. Sigh, I gave up yesterday.

for the context, I was parsing the metadata.tar.gz of 200MB just text (when uncompressed it’s like 20G) of video metadatas.

Skoffer · January 14, 2021, 6:09pm

For Tar you can also try to use TarIterators.jl

Topic		Replies	Views
Reading files embedded in a Zip-file General Usage zip	10	3934	September 2, 2024
Extracting contents of .tgz file General Usage filesystem	1	1608	February 4, 2022
Performant reading of .tar.xz files Performance question , speed-optimization	15	714	September 18, 2023
How to plumb together download -> uncompress -> untar without writing full downloaded file New to Julia	15	1596	February 23, 2022
Multiple files in a gzip archive General Usage	6	2137	June 15, 2017

How to read files from a compressed file (zip/gz) lazily?

Related topics