Avoid disk write when reading gzipped file from s3

ilanggear · March 10, 2023, 2:21am

I have a goal similar to the following thread, but my data is gzipped text, not tiff. Retrieve data from Amazon S3?.

I want to read many 130M gzipped text files from s3 one by one, unzip each file, extract a regex match (to be stored later) and then discard the s3 file without ever writing to disk.

So I have this attempt:
for line in ZipFile.Reader(load(Stream(format"GZ", IOBuffer(s3obj)))) but I’m getting ERROR: LoadError: No applicable_loaders found for GZ

I also tried this variant:
for fname in ZipFile.Reader(FileIO.load(IOBuffer(obj))) with this result: ERROR: LoadError: ArgumentError: Unrecognized RDA formatd��Yconll.paths.csv��14�r��{�}��b��P��ރ}mdIi��Yτ��t��/��׿��o��ן��?��?��_��_��?��_�� $l"

Is there a way to do what I want?

ericphanson · March 10, 2023, 2:35am

I would try GzipDecompressorStream from GitHub - JuliaIO/CodecZlib.jl: zlib codecs for TranscodingStreams.jl. instead of ZipFile

ilanggear · March 10, 2023, 3:03am

That was the missing link. Thanks!!

Topic		Replies	Views
Reading files embedded in a Zip-file General Usage zip	10	3934	September 2, 2024
Reading .csv.gz with CSV does not find readavailable(::GZipStream) Data csv	4	775	August 28, 2019
How to read files from a compressed file (zip/gz) lazily? New to Julia question , lazy-evaluation , zip	3	2451	January 14, 2021
Streaming gziped file to FASTQ.Reader - where to add method? General Usage question , biology , input-output	2	1040	March 20, 2020
Read gzipped protobuf file General Usage	1	601	November 16, 2020

Avoid disk write when reading gzipped file from s3

Related topics