GzipDecompressionStream compared to GZip.jl?

freeboson · August 18, 2017, 1:45am

How does the GzipDecompressionStream compare to GZip.jl for line by line reading? Also, thank you for making this very useful package!

bicycle1885 · August 18, 2017, 4:26am

Both packages support line-based IO operations but there are some differences.

GzipDecompressionStream can wrap any I/O stream while GZipStream can only read
data from a file. For example, if your gzip data come from other process
via pipe, you can use GzipDecompressionStream but cannot use GZipStream.

julia> using CodecZlib

julia> pipe, proc = open(`cat movies.csv.gz`)
(Pipe(RawFD(-1) closed => RawFD(17) open, 0 bytes waiting), Process(`cat movies.csv.gz`, ProcessRunning))

julia> stream = GzipDecompressionStream(pipe)
TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompression,Pipe}(<state=idle>)

julia> readline(stream)
"\"\",\"title\",\"year\",\"length\",\"budget\",\"rating\",\"votes\",\"r1\",\"r2\",\"r3\",\"r4\",\"r5\",\"r6\",\"r7\",\"r8\",\"r9\",\"r10\",\"mpaa\",\"Action\",\"Animation\",\"Comedy\",\"Drama\",\"Documentary\",\"Romance\",\"Short\""

julia> readline(stream)
"\"1\",\"\$\",1971,121,NA,6.4,348,4.5,4.5,4.5,4.5,14.5,24.5,24.5,14.5,4.5,4.5,\"\",0,0,1,1,0,0,0"

julia> close(stream)

julia> pipe
Pipe(RawFD(-1) closed => RawFD(-1) closed, 278528 bytes waiting)

julia> proc
Process(`cat movies.csv.gz`, ProcessSignaled(13))

In terms of performance, on my local benchmark, CodecZlib.jl is significantly
faster than GZip.jl:

shell> ls -lh
total 4896
-rw-r--r--  1 kenta  staff    52B Aug  7  2016 1000x2.csv.gz
-rw-r--r--  1 kenta  staff   971K Aug  7  2016 movies.csv.bz2
-rw-r--r--  1 kenta  staff   1.4M Aug  7  2016 movies.csv.gz

shell> gzip -cd movies.csv.gz | wc -l
   58789

julia> using BenchmarkTools

julia> using GZip

julia> @benchmark foreach(identity, eachline(GZip.gzopen("movies.csv.gz")))
BenchmarkTools.Trial:
  memory estimate:  37.84 MiB
  allocs estimate:  530031
  --------------
  minimum time:     279.652 ms (1.81% GC)
  median time:      287.000 ms (1.68% GC)
  mean time:        287.984 ms (1.50% GC)
  maximum time:     305.171 ms (0.91% GC)
  --------------
  samples:          18
  evals/sample:     1

julia> using CodecZlib

julia> @benchmark foreach(identity, eachline(GzipDecompressionStream(open("movies.csv.gz"))))
BenchmarkTools.Trial:
  memory estimate:  24.55 MiB
  allocs estimate:  236274
  --------------
  minimum time:     38.299 ms (3.82% GC)
  median time:      39.422 ms (4.37% GC)
  mean time:        40.007 ms (4.87% GC)
  maximum time:     47.506 ms (4.79% GC)
  --------------
  samples:          125
  evals/sample:     1

freeboson · August 22, 2017, 2:00am

Yeah, this is absolutely great. I am happily using XzDecompressionStream in my project, and it’s working very well.

Topic		Replies	Views
[ANN] TranscodingStreams.jl - new APIs to zlib, bzip2, xz, zstd and more! Community package , announcement	2	1456	August 18, 2017
Read/write compressed files in julia 0.7 General Usage	3	1128	June 13, 2019
Speed comparison on reading a gzip file General Usage performance	8	1466	December 10, 2023
Slow gzip streaming in julia but not in python General Usage performance	20	1957	March 18, 2021
How to unpack an .xz file with Julia General Usage question	7	676	April 26, 2021

GzipDecompressionStream compared to GZip.jl?

Related topics