Read Vector{UInt8} lines from a gzipped file (optimization)

question

#1

I am reading (and parsing) a huge dataset, essentially ASCII lines in gzipped files. I have optimized the parsing part, now I am trying to figure out how to just read the lines efficiently.

I am using the excellent CodecZlib (which believe is the fastest solution). Extracting the data from the string returned by readline is pretty fast, and so is readuntil (actually better). I thought I would make it faster by reading into a buffer (line length can be bounded). But it is actually slower.

If anyone wants to help out with making this faster, I made a self-contained MWE, available below. This should offer sub-second runtimes for reading on a recent computer, my actual runtimes are around 30 minutes for the whole dataset, and I do this repeatedly, so every little bit helps.


#2

Basically, the only thing you want to do is to be able to give a preallocated buffer to readuntil. Since TranscodingStreams doesn’t seem to provide this we can hack our own (readuntil2) just to try out the performance:

@eval TranscodingStreams begin
    readuntil2(stream::TranscodingStream, delim::UInt8) = readuntil2!(UInt8[], stream, delim)

    function readuntil2!(ret::Vector{UInt8}, stream::TranscodingStream, delim::UInt8)
        changestate!(stream, :read)
        buffer1 = stream.state.buffer1
        resize!(ret, 0)
        filled = 0
        while !eof(stream)
            pos = findbyte(buffer1, delim)
            if pos == 0
                sz = buffersize(buffer1)
                if length(ret) < filled + sz
                    resize!(ret, filled + sz)
                end
            else
                sz = pos - buffer1.bufferpos + 1
                resize!(ret, filled + sz)
            end
            readdata!(buffer1, ret, filled+1, sz)
            filled += sz
            if pos > 0
                break
            end
        end
        return ret
    end
end

We then get:

julia> struct Buffer
           buffer::Vector{UInt8}
           Buffer() = new(UInt8[])
       end

julia> (buff::Buffer)(io::IO) = TranscodingStreams.readuntil2!(buff.buffer, io, UInt8('\n'))

julia> io = GzipDecompressionStream(open(filename));

julia> buff = Buffer()
Buffer(UInt8[])

julia> @time dolines(buff, io, 1_000_000)
  0.298013 seconds (6.58 k allocations: 154.266 KiB)
53847323

Compared to e.g. readline1

julia> io = GzipDecompressionStream(open(filename));

julia> @time dolines(readline1, io, 1_000_000)
  0.555570 seconds (5.01 M allocations: 381.854 MiB, 18.27% gc time)
53847323

#3

@kristoffer.carlsson: Thanks, this is indeed a large improvement (actually 3x faster in some scenarios I am testing).

Just checking: OK if I submit a PR based on your code to TranscodingStreams? Which is under MIT “Expat” License.


#4

Sure