Avoid disk write when reading gzipped file from s3

I have a goal similar to the following thread, but my data is gzipped text, not tiff. Retrieve data from Amazon S3?.

I want to read many 130M gzipped text files from s3 one by one, unzip each file, extract a regex match (to be stored later) and then discard the s3 file without ever writing to disk.

So I have this attempt:
for line in ZipFile.Reader(load(Stream(format"GZ", IOBuffer(s3obj)))) but I’m getting ERROR: LoadError: No applicable_loaders found for GZ

I also tried this variant:
for fname in ZipFile.Reader(FileIO.load(IOBuffer(obj))) with this result: ERROR: LoadError: ArgumentError: Unrecognized RDA formatd��Yconll.paths.csv��14�r���{�}��b��P��ރ}mdIi��Yτ���t���/�������������׿��o�����������ן�����?��?��_����_���?������_��� �$l"

Is there a way to do what I want?

1 Like

I would try GzipDecompressorStream from GitHub - JuliaIO/CodecZlib.jl: zlib codecs for TranscodingStreams.jl. instead of ZipFile

1 Like

That was the missing link. Thanks!!

1 Like