GzipDecompressionStream compared to GZip.jl?


How does the GzipDecompressionStream compare to GZip.jl for line by line reading? Also, thank you for making this very useful package!

[ANN] TranscodingStreams.jl - new APIs to zlib, bzip2, xz, zstd and more!

Both packages support line-based IO operations but there are some differences.

GzipDecompressionStream can wrap any I/O stream while GZipStream can only read
data from a file. For example, if your gzip data come from other process
via pipe, you can use GzipDecompressionStream but cannot use GZipStream.

julia> using CodecZlib

julia> pipe, proc = open(`cat movies.csv.gz`)
(Pipe(RawFD(-1) closed => RawFD(17) open, 0 bytes waiting), Process(`cat movies.csv.gz`, ProcessRunning))

julia> stream = GzipDecompressionStream(pipe)

julia> readline(stream)

julia> readline(stream)

julia> close(stream)

julia> pipe
Pipe(RawFD(-1) closed => RawFD(-1) closed, 278528 bytes waiting)

julia> proc
Process(`cat movies.csv.gz`, ProcessSignaled(13))

In terms of performance, on my local benchmark, CodecZlib.jl is significantly
faster than GZip.jl:

shell> ls -lh
total 4896
-rw-r--r--  1 kenta  staff    52B Aug  7  2016 1000x2.csv.gz
-rw-r--r--  1 kenta  staff   971K Aug  7  2016 movies.csv.bz2
-rw-r--r--  1 kenta  staff   1.4M Aug  7  2016 movies.csv.gz

shell> gzip -cd movies.csv.gz | wc -l

julia> using BenchmarkTools

julia> using GZip

julia> @benchmark foreach(identity, eachline(GZip.gzopen("movies.csv.gz")))
  memory estimate:  37.84 MiB
  allocs estimate:  530031
  minimum time:     279.652 ms (1.81% GC)
  median time:      287.000 ms (1.68% GC)
  mean time:        287.984 ms (1.50% GC)
  maximum time:     305.171 ms (0.91% GC)
  samples:          18
  evals/sample:     1

julia> using CodecZlib

julia> @benchmark foreach(identity, eachline(GzipDecompressionStream(open("movies.csv.gz"))))
  memory estimate:  24.55 MiB
  allocs estimate:  236274
  minimum time:     38.299 ms (3.82% GC)
  median time:      39.422 ms (4.37% GC)
  mean time:        40.007 ms (4.87% GC)
  maximum time:     47.506 ms (4.79% GC)
  samples:          125
  evals/sample:     1


Yeah, this is absolutely great. I am happily using XzDecompressionStream in my project, and it’s working very well.