I wanted to read a gzip’d text file line by line. There are several options: 1) call gzgets()
from zlib; 2) implement my own buffering; 3) reuse readline(::IO)
. I would like to try 3). This is the code I got so far:
mutable struct GzFile <: IO
fp::Ptr{Cvoid}
function GzFile(fn::String, mode = "r")
x = ccall((:gzopen, "libz"), Ptr{Cvoid}, (Cstring, Cstring), fn, mode)
x == C_NULL ? nothing : new(x)
end
end
function eof(fp::GzFile)
ret = ccall((:gzeof, "libz"), Cint, (Ptr{Cvoid},), fp.fp)
ret != 0 ? true : false
end
function main(args)
if length(args) == 0 return end
fp = GzFile(args[1])
s = readline(fp)
end
main(ARGS)
When I run this script with julia script.jl file.txt
, I got the following error
ERROR: LoadError: MethodError: no method matching eof(::GzFile)
even though I have defined eof()
. What is the cause of this error? How to fix? What more methods do I need in order to get readline(::IO)
working? Or is 3) a sensible approach at all? Thanks in advance.
Did you get a warning about eof
already being defined in Base? My first thought would be that you’d have to import the method to extend it, though that’s in for modules. I don’t know how it behaves in scripts.
Answering my own question. The right solution is to “overload”/redefine Base.read
and Base.eof
as follows:
function Base.read(fp::GzFile, ::Type{UInt8})::UInt8
UInt8(ccall((:gzgetc, "libz"), Cint, (Ptr{Cvoid},), fp.fp))
end
function Base.eof(fp::GzFile)
ret = ccall((:gzeof, "libz"), Cint, (Ptr{Cvoid},), fp.fp)
ret != 0 ? true : false
end
An even simpler solution is to use the CodecZlib and TranscodingStreams packages directly. This allows you to open the Gzip file directly and use normal I/O operations such as readln. No need to reimplement this functionality.
3 Likes
This is especially true because directly ccall
ing libz will fail on systems where no such dynamic library is installed or it’s installed under a different name. CodecZlib.jl does exactly what you want and uses BinaryProvider to guarantee availability of libz.
This is more for learning. I have already tried CodecZlib. However, it is too slow in the form of
for line in eachline(GzipDecompressorStream(open(filename)))
end
Probably this is the fault of Base.eachline
.
PS: another problem with CodecZlib, as I understand, is that it doesn’t seamlessly open plain text and gzip’d text.
Can you give me an example that makes your code slow with CodecZlib.jl? I carefully tune the performance, and I think there is no measureable performance difference between CodecZlib.jl and the gzip command.
Also, you can use the Noop
codec defined in TranscodingStreams.jl to seamlessly handle compressed and uncompressed data (see https://bicycle1885.github.io/TranscodingStreams.jl/latest/examples.html#Use-a-noop-codec-1).
EDIT: Ah, you may be right. I quickly benchmarked CodecZlib.jl on Julia 0.7 and I found it is slower than I expected. I’ll take a look tomorrow.
1 Like
This is the full script, run with Julia 1.0:
using Profile
using CodecZlib
function main(args)
sum, nl = 0, 0
for line in eachline(GzipDecompressorStream(open(args[1])))
sum += lastindex(line)
nl += 1
end
println("$sum\t$nl")
end
@profile main(ARGS)
Profile.print()
The bottleneck is the following version of readuntil()
around io.jl:650
, from v1.0:
function readuntil(s::IO, delim::T; keep::Bool=false) where T
out = (T === UInt8 ? StringVector(0) : Vector{T}())
while !eof(s)
c = read(s, T)
if c == delim
keep && push!(out, c)
break
end
push!(out, c)
end
return out
end
Such single-byte operations are going to be slow.
EDIT: The Noop codec is not good enough. When we read from a unix pipe, we can’t tell the file type from its name. gzread() in zlib works in this case.
Yes, I reached the same conclusion that readuntil
is slow because it reads data byte by byte. I didn’t noticed the slowdown because Julia 0.7 introduced a new argument, the keep
keyword argument, which made TranscodingStreams.jl fall back on the default readuntil
implementation (this is very slow!). I’ve fixed that problem with this change (make readuntil accept the keep argument by bicycle1885 · Pull Request #63 · JuliaIO/TranscodingStreams.jl · GitHub) and found that it now becomes 4-5 times faster when reading text line by line.
2 Likes
Yes. Your readuntil()
is much better. The current Julia IO
requires users to implement Base.eof(::IO)
and Base.read(::IO,::UInt8)
and then read everything by byte. This is known to be very slow. The right solution is to require a single readBytes!(::IO, Vector{UInt8})::UInt
and read data in chunk. Go and several other languages are doing this. In JuliaLang PR #16656, you have suggested something similar. Unfortunately, they didn’t follow up. The Julia IO
stack needs a revamp to be performance.