I am reading (and parsing) a huge dataset, essentially ASCII lines in gzipped files. I have optimized the parsing part, now I am trying to figure out how to just read the lines efficiently.
I am using the excellent
CodecZlib (which believe is the fastest solution). Extracting the data from the string returned by
readline is pretty fast, and so is
readuntil (actually better). I thought I would make it faster by reading into a buffer (line length can be bounded). But it is actually slower.
If anyone wants to help out with making this faster, I made a self-contained MWE, available below. This should offer sub-second runtimes for reading on a recent computer, my actual runtimes are around 30 minutes for the whole dataset, and I do this repeatedly, so every little bit helps.