Read a gzip'd text file line by line using Base.readline(::IO)

question

#1

I wanted to read a gzip’d text file line by line. There are several options: 1) call gzgets() from zlib; 2) implement my own buffering; 3) reuse readline(::IO). I would like to try 3). This is the code I got so far:

mutable struct GzFile <: IO
	fp::Ptr{Cvoid}

	function GzFile(fn::String, mode = "r")
		x = ccall((:gzopen, "libz"), Ptr{Cvoid}, (Cstring, Cstring), fn, mode)
		x == C_NULL ? nothing : new(x)
	end
end

function eof(fp::GzFile)
	ret = ccall((:gzeof, "libz"), Cint, (Ptr{Cvoid},), fp.fp)
	ret != 0 ? true : false
end

function main(args)
	if length(args) == 0 return end
	fp = GzFile(args[1])
	s = readline(fp)
end

main(ARGS)

When I run this script with julia script.jl file.txt, I got the following error

ERROR: LoadError: MethodError: no method matching eof(::GzFile)

even though I have defined eof(). What is the cause of this error? How to fix? What more methods do I need in order to get readline(::IO) working? Or is 3) a sensible approach at all? Thanks in advance.


#2

Did you get a warning about eof already being defined in Base? My first thought would be that you’d have to import the method to extend it, though that’s in for modules. I don’t know how it behaves in scripts.


#3

Answering my own question. The right solution is to “overload”/redefine Base.read and Base.eof as follows:

function Base.read(fp::GzFile, ::Type{UInt8})::UInt8
	UInt8(ccall((:gzgetc, "libz"), Cint, (Ptr{Cvoid},), fp.fp))
end

function Base.eof(fp::GzFile)
	ret = ccall((:gzeof, "libz"), Cint, (Ptr{Cvoid},), fp.fp)
	ret != 0 ? true : false
end

#4

An even simpler solution is to use the CodecZlib and TranscodingStreams packages directly. This allows you to open the Gzip file directly and use normal I/O operations such as readln. No need to reimplement this functionality.


#5

This is especially true because directly ccalling libz will fail on systems where no such dynamic library is installed or it’s installed under a different name. CodecZlib.jl does exactly what you want and uses BinaryProvider to guarantee availability of libz.


#6

This is more for learning. I have already tried CodecZlib. However, it is too slow in the form of

for line in eachline(GzipDecompressorStream(open(filename)))
end

Probably this is the fault of Base.eachline.

PS: another problem with CodecZlib, as I understand, is that it doesn’t seamlessly open plain text and gzip’d text.


#7

Can you give me an example that makes your code slow with CodecZlib.jl? I carefully tune the performance, and I think there is no measureable performance difference between CodecZlib.jl and the gzip command.

Also, you can use the Noop codec defined in TranscodingStreams.jl to seamlessly handle compressed and uncompressed data (see https://bicycle1885.github.io/TranscodingStreams.jl/latest/examples.html#Use-a-noop-codec-1).

EDIT: Ah, you may be right. I quickly benchmarked CodecZlib.jl on Julia 0.7 and I found it is slower than I expected. I’ll take a look tomorrow.


#8

This is the full script, run with Julia 1.0:

using Profile
using CodecZlib

function main(args)
	sum, nl = 0, 0
	for line in eachline(GzipDecompressorStream(open(args[1])))
		sum += lastindex(line)
		nl += 1
	end
	println("$sum\t$nl")
end

@profile main(ARGS)
Profile.print()

The bottleneck is the following version of readuntil() around io.jl:650, from v1.0:

function readuntil(s::IO, delim::T; keep::Bool=false) where T
    out = (T === UInt8 ? StringVector(0) : Vector{T}())
    while !eof(s)
        c = read(s, T)
        if c == delim
            keep && push!(out, c)
            break
        end
        push!(out, c)
    end
    return out
end

Such single-byte operations are going to be slow.

EDIT: The Noop codec is not good enough. When we read from a unix pipe, we can’t tell the file type from its name. gzread() in zlib works in this case.


#9

Yes, I reached the same conclusion that readuntil is slow because it reads data byte by byte. I didn’t noticed the slowdown because Julia 0.7 introduced a new argument, the keep keyword argument, which made TranscodingStreams.jl fall back on the default readuntil implementation (this is very slow!). I’ve fixed that problem with this change (https://github.com/bicycle1885/TranscodingStreams.jl/pull/63) and found that it now becomes 4-5 times faster when reading text line by line.


#10

Yes. Your readuntil() is much better. The current Julia IO requires users to implement Base.eof(::IO) and Base.read(::IO,::UInt8) and then read everything by byte. This is known to be very slow. The right solution is to require a single readBytes!(::IO, Vector{UInt8})::UInt and read data in chunk. Go and several other languages are doing this. In JuliaLang PR #16656, you have suggested something similar. Unfortunately, they didn’t follow up. The Julia IO stack needs a revamp to be performance.