I am reading (and parsing) a huge dataset, essentially ASCII lines in gzipped files. I have optimized the parsing part, now I am trying to figure out how to just read the lines efficiently.
I am using the excellent CodecZlib
(which believe is the fastest solution). Extracting the data from the string returned by readline
is pretty fast, and so is readuntil
(actually better). I thought I would make it faster by reading into a buffer (line length can be bounded). But it is actually slower.
If anyone wants to help out with making this faster, I made a self-contained MWE, available below. This should offer sub-second runtimes for reading on a recent computer, my actual runtimes are around 30 minutes for the whole dataset, and I do this repeatedly, so every little bit helps.
https://gist.github.com/tpapp/66f053bcab3f763a72318a9e8ef6e177
Basically, the only thing you want to do is to be able to give a preallocated buffer to readuntil
. Since TranscodingStreams doesn’t seem to provide this we can hack our own (readuntil2
) just to try out the performance:
@eval TranscodingStreams begin
readuntil2(stream::TranscodingStream, delim::UInt8) = readuntil2!(UInt8[], stream, delim)
function readuntil2!(ret::Vector{UInt8}, stream::TranscodingStream, delim::UInt8)
changestate!(stream, :read)
buffer1 = stream.state.buffer1
resize!(ret, 0)
filled = 0
while !eof(stream)
pos = findbyte(buffer1, delim)
if pos == 0
sz = buffersize(buffer1)
if length(ret) < filled + sz
resize!(ret, filled + sz)
end
else
sz = pos - buffer1.bufferpos + 1
resize!(ret, filled + sz)
end
readdata!(buffer1, ret, filled+1, sz)
filled += sz
if pos > 0
break
end
end
return ret
end
end
We then get:
julia> struct Buffer
buffer::Vector{UInt8}
Buffer() = new(UInt8[])
end
julia> (buff::Buffer)(io::IO) = TranscodingStreams.readuntil2!(buff.buffer, io, UInt8('\n'))
julia> io = GzipDecompressionStream(open(filename));
julia> buff = Buffer()
Buffer(UInt8[])
julia> @time dolines(buff, io, 1_000_000)
0.298013 seconds (6.58 k allocations: 154.266 KiB)
53847323
Compared to e.g. readline1
julia> io = GzipDecompressionStream(open(filename));
julia> @time dolines(readline1, io, 1_000_000)
0.555570 seconds (5.01 M allocations: 381.854 MiB, 18.27% gc time)
53847323
@kristoffer.carlsson: Thanks, this is indeed a large improvement (actually 3x faster in some scenarios I am testing).
Just checking: OK if I submit a PR based on your code to TranscodingStreams
? Which is under MIT “Expat” License.