How does the GzipDecompressionStream
compare to GZip.jl
for line by line reading? Also, thank you for making this very useful package!
Both packages support line-based IO operations but there are some differences.
GzipDecompressionStream
can wrap any I/O stream while GZipStream
can only read
data from a file. For example, if your gzip data come from other process
via pipe, you can use GzipDecompressionStream
but cannot use GZipStream
.
julia> using CodecZlib
julia> pipe, proc = open(`cat movies.csv.gz`)
(Pipe(RawFD(-1) closed => RawFD(17) open, 0 bytes waiting), Process(`cat movies.csv.gz`, ProcessRunning))
julia> stream = GzipDecompressionStream(pipe)
TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompression,Pipe}(<state=idle>)
julia> readline(stream)
"\"\",\"title\",\"year\",\"length\",\"budget\",\"rating\",\"votes\",\"r1\",\"r2\",\"r3\",\"r4\",\"r5\",\"r6\",\"r7\",\"r8\",\"r9\",\"r10\",\"mpaa\",\"Action\",\"Animation\",\"Comedy\",\"Drama\",\"Documentary\",\"Romance\",\"Short\""
julia> readline(stream)
"\"1\",\"\$\",1971,121,NA,6.4,348,4.5,4.5,4.5,4.5,14.5,24.5,24.5,14.5,4.5,4.5,\"\",0,0,1,1,0,0,0"
julia> close(stream)
julia> pipe
Pipe(RawFD(-1) closed => RawFD(-1) closed, 278528 bytes waiting)
julia> proc
Process(`cat movies.csv.gz`, ProcessSignaled(13))
In terms of performance, on my local benchmark, CodecZlib.jl is significantly
faster than GZip.jl:
shell> ls -lh
total 4896
-rw-r--r-- 1 kenta staff 52B Aug 7 2016 1000x2.csv.gz
-rw-r--r-- 1 kenta staff 971K Aug 7 2016 movies.csv.bz2
-rw-r--r-- 1 kenta staff 1.4M Aug 7 2016 movies.csv.gz
shell> gzip -cd movies.csv.gz | wc -l
58789
julia> using BenchmarkTools
julia> using GZip
julia> @benchmark foreach(identity, eachline(GZip.gzopen("movies.csv.gz")))
BenchmarkTools.Trial:
memory estimate: 37.84 MiB
allocs estimate: 530031
--------------
minimum time: 279.652 ms (1.81% GC)
median time: 287.000 ms (1.68% GC)
mean time: 287.984 ms (1.50% GC)
maximum time: 305.171 ms (0.91% GC)
--------------
samples: 18
evals/sample: 1
julia> using CodecZlib
julia> @benchmark foreach(identity, eachline(GzipDecompressionStream(open("movies.csv.gz"))))
BenchmarkTools.Trial:
memory estimate: 24.55 MiB
allocs estimate: 236274
--------------
minimum time: 38.299 ms (3.82% GC)
median time: 39.422 ms (4.37% GC)
mean time: 40.007 ms (4.87% GC)
maximum time: 47.506 ms (4.79% GC)
--------------
samples: 125
evals/sample: 1
4 Likes
Yeah, this is absolutely great. I am happily using XzDecompressionStream
in my project, and it’s working very well.