Read a gzip'd text file line by line using Base.readline(::IO)

attractivechaos · August 12, 2018, 6:12am

I wanted to read a gzip’d text file line by line. There are several options: 1) call gzgets() from zlib; 2) implement my own buffering; 3) reuse readline(::IO). I would like to try 3). This is the code I got so far:

mutable struct GzFile <: IO
	fp::Ptr{Cvoid}

	function GzFile(fn::String, mode = "r")
		x = ccall((:gzopen, "libz"), Ptr{Cvoid}, (Cstring, Cstring), fn, mode)
		x == C_NULL ? nothing : new(x)
	end
end

function eof(fp::GzFile)
	ret = ccall((:gzeof, "libz"), Cint, (Ptr{Cvoid},), fp.fp)
	ret != 0 ? true : false
end

function main(args)
	if length(args) == 0 return end
	fp = GzFile(args[1])
	s = readline(fp)
end

main(ARGS)

When I run this script with julia script.jl file.txt, I got the following error

ERROR: LoadError: MethodError: no method matching eof(::GzFile)

even though I have defined eof(). What is the cause of this error? How to fix? What more methods do I need in order to get readline(::IO) working? Or is 3) a sensible approach at all? Thanks in advance.

kevbonham · August 12, 2018, 12:26pm

Did you get a warning about eof already being defined in Base? My first thought would be that you’d have to import the method to extend it, though that’s in for modules. I don’t know how it behaves in scripts.

attractivechaos · August 12, 2018, 4:36pm

Answering my own question. The right solution is to “overload”/redefine Base.read and Base.eof as follows:

function Base.read(fp::GzFile, ::Type{UInt8})::UInt8
	UInt8(ccall((:gzgetc, "libz"), Cint, (Ptr{Cvoid},), fp.fp))
end

function Base.eof(fp::GzFile)
	ret = ccall((:gzeof, "libz"), Cint, (Ptr{Cvoid},), fp.fp)
	ret != 0 ? true : false
end

js135005 · August 12, 2018, 5:16pm

An even simpler solution is to use the CodecZlib and TranscodingStreams packages directly. This allows you to open the Gzip file directly and use normal I/O operations such as readln. No need to reimplement this functionality.

jekbradbury · August 12, 2018, 5:34pm

This is especially true because directly ccalling libz will fail on systems where no such dynamic library is installed or it’s installed under a different name. CodecZlib.jl does exactly what you want and uses BinaryProvider to guarantee availability of libz.

attractivechaos · August 12, 2018, 5:48pm

This is more for learning. I have already tried CodecZlib. However, it is too slow in the form of

for line in eachline(GzipDecompressorStream(open(filename)))
end

Probably this is the fault of Base.eachline.

PS: another problem with CodecZlib, as I understand, is that it doesn’t seamlessly open plain text and gzip’d text.

bicycle1885 · August 12, 2018, 6:55pm

Can you give me an example that makes your code slow with CodecZlib.jl? I carefully tune the performance, and I think there is no measureable performance difference between CodecZlib.jl and the gzip command.

Also, you can use the Noop codec defined in TranscodingStreams.jl to seamlessly handle compressed and uncompressed data (see https://bicycle1885.github.io/TranscodingStreams.jl/latest/examples.html#Use-a-noop-codec-1).

EDIT: Ah, you may be right. I quickly benchmarked CodecZlib.jl on Julia 0.7 and I found it is slower than I expected. I’ll take a look tomorrow.

attractivechaos · August 13, 2018, 2:22am

This is the full script, run with Julia 1.0:

using Profile
using CodecZlib

function main(args)
	sum, nl = 0, 0
	for line in eachline(GzipDecompressorStream(open(args[1])))
		sum += lastindex(line)
		nl += 1
	end
	println("$sum\t$nl")
end

@profile main(ARGS)
Profile.print()

The bottleneck is the following version of readuntil() around io.jl:650, from v1.0:

function readuntil(s::IO, delim::T; keep::Bool=false) where T
    out = (T === UInt8 ? StringVector(0) : Vector{T}())
    while !eof(s)
        c = read(s, T)
        if c == delim
            keep && push!(out, c)
            break
        end
        push!(out, c)
    end
    return out
end

Such single-byte operations are going to be slow.

EDIT: The Noop codec is not good enough. When we read from a unix pipe, we can’t tell the file type from its name. gzread() in zlib works in this case.

bicycle1885 · August 13, 2018, 3:45am

Yes, I reached the same conclusion that readuntil is slow because it reads data byte by byte. I didn’t noticed the slowdown because Julia 0.7 introduced a new argument, the keep keyword argument, which made TranscodingStreams.jl fall back on the default readuntil implementation (this is very slow!). I’ve fixed that problem with this change (make readuntil accept the keep argument by bicycle1885 · Pull Request #63 · JuliaIO/TranscodingStreams.jl · GitHub) and found that it now becomes 4-5 times faster when reading text line by line.

attractivechaos · August 13, 2018, 12:05pm

Yes. Your readuntil() is much better. The current Julia IO requires users to implement Base.eof(::IO) and Base.read(::IO,::UInt8) and then read everything by byte. This is known to be very slow. The right solution is to require a single readBytes!(::IO, Vector{UInt8})::UInt and read data in chunk. Go and several other languages are doing this. In JuliaLang PR #16656, you have suggested something similar. Unfortunately, they didn’t follow up. The Julia IO stack needs a revamp to be performance.

Topic		Replies	Views
Read Vector{UInt8} lines from a gzipped file (optimization) General Usage question	3	587	October 1, 2017
Reading .csv.gz with CSV does not find readavailable(::GZipStream) Data csv	4	791	August 28, 2019
GzipDecompressionStream compared to GZip.jl? General Usage	2	1438	August 22, 2017
JuliaDB - Support for reading GZ file extension Data	1	1004	May 11, 2017
Read/write compressed files in julia 0.7 General Usage	3	1149	June 13, 2019

Read a gzip'd text file line by line using Base.readline(::IO)

Related topics