GzipDecompressionStream compared to GZip.jl?


#1

How does the GzipDecompressionStream compare to GZip.jl for line by line reading? Also, thank you for making this very useful package!


[ANN] TranscodingStreams.jl - new APIs to zlib, bzip2, xz, zstd and more!
#2

Both packages support line-based IO operations but there are some differences.

GzipDecompressionStream can wrap any I/O stream while GZipStream can only read
data from a file. For example, if your gzip data come from other process
via pipe, you can use GzipDecompressionStream but cannot use GZipStream.

julia> using CodecZlib

julia> pipe, proc = open(`cat movies.csv.gz`)
(Pipe(RawFD(-1) closed => RawFD(17) open, 0 bytes waiting), Process(`cat movies.csv.gz`, ProcessRunning))

julia> stream = GzipDecompressionStream(pipe)
TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompression,Pipe}(<state=idle>)

julia> readline(stream)
"\"\",\"title\",\"year\",\"length\",\"budget\",\"rating\",\"votes\",\"r1\",\"r2\",\"r3\",\"r4\",\"r5\",\"r6\",\"r7\",\"r8\",\"r9\",\"r10\",\"mpaa\",\"Action\",\"Animation\",\"Comedy\",\"Drama\",\"Documentary\",\"Romance\",\"Short\""

julia> readline(stream)
"\"1\",\"\$\",1971,121,NA,6.4,348,4.5,4.5,4.5,4.5,14.5,24.5,24.5,14.5,4.5,4.5,\"\",0,0,1,1,0,0,0"

julia> close(stream)

julia> pipe
Pipe(RawFD(-1) closed => RawFD(-1) closed, 278528 bytes waiting)

julia> proc
Process(`cat movies.csv.gz`, ProcessSignaled(13))

In terms of performance, on my local benchmark, CodecZlib.jl is significantly
faster than GZip.jl:

shell> ls -lh
total 4896
-rw-r--r--  1 kenta  staff    52B Aug  7  2016 1000x2.csv.gz
-rw-r--r--  1 kenta  staff   971K Aug  7  2016 movies.csv.bz2
-rw-r--r--  1 kenta  staff   1.4M Aug  7  2016 movies.csv.gz

shell> gzip -cd movies.csv.gz | wc -l
   58789

julia> using BenchmarkTools

julia> using GZip

julia> @benchmark foreach(identity, eachline(GZip.gzopen("movies.csv.gz")))
BenchmarkTools.Trial:
  memory estimate:  37.84 MiB
  allocs estimate:  530031
  --------------
  minimum time:     279.652 ms (1.81% GC)
  median time:      287.000 ms (1.68% GC)
  mean time:        287.984 ms (1.50% GC)
  maximum time:     305.171 ms (0.91% GC)
  --------------
  samples:          18
  evals/sample:     1

julia> using CodecZlib

julia> @benchmark foreach(identity, eachline(GzipDecompressionStream(open("movies.csv.gz"))))
BenchmarkTools.Trial:
  memory estimate:  24.55 MiB
  allocs estimate:  236274
  --------------
  minimum time:     38.299 ms (3.82% GC)
  median time:      39.422 ms (4.37% GC)
  mean time:        40.007 ms (4.87% GC)
  maximum time:     47.506 ms (4.79% GC)
  --------------
  samples:          125
  evals/sample:     1

#3

Yeah, this is absolutely great. I am happily using XzDecompressionStream in my project, and it’s working very well.