lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

I recreated the C, Julia and Python part of the first benchmark. A few observations:

The library Klib.jl is a bit obscure, seemingly just used by one guy, and not very well maintained. I don’t think it’s useful to pick that code apart, it’s just one random guy’s Julia code. It’s much more interesting to look at the FASTQ benchmark using FASTX.jl from BioJulia.

  • I don’t see the same relative numbers b/w C, Python and Julia that he does - for me, Julia does about 30% better on non-zipped data. It might be a matter of hardware, or the fact that I ran FASTX v 1.1, not 1.0.

  • Looking at his Julia code, it looks great. There’s some type instability, but union splitting takes care of that, and it’s no performance concern.

  • For the FASTX code, I noticed the FASTQ parser does not use inbounds. Since it reads the input byte by byte, this has a rather large effect, taking another 30% time off the non-zipped input. With this optimization, the running speed of Julia is about 1.25-1.3 times as long as his C code. Not bad! We probably shouldn’t actually remove the boundschecks, because IMO the extra safety is worth a little extra time, especially considering that FASTQ files are essentially always gzipped.

  • For the gzipped data, it appears that CodecZlib is very slow, something like 4x slower than zlib. That’s very strange, considering it just calls zlib directly. Profiling confirms that almost all time is spent in the ccall line. I created this issue - Jaakko Ruohio on Slack discovered about half the problem was that zlib was not compiled with -O3, but there are still more gains to be had, I just don’t know where to find them. After this, Julia is about 1.4 slower than C on gzipped data - whereas we ought to be very near C speed here, something like a factor of 1.1. If anyone can fix the CodecZlib issue, that would be great.

  • The implementation of the FASTQ format and the parsing is very efficient. Except for the boundscheck issue and the zlib issue, the only place left I can find is to optimize Automa.jl directly - which would always be welcome, but I don’t know how.

  • Edit: Oh yes, and the elephant in the room: The Julia code goes through the 5 million FASTQ read in 5 seconds, whereas it takes around 11 when called from command line due to compile time latency. That puts us way behind C speed, just a tad before Python.

Edit2: It seems the zlib.h source code that CodecZlib obtains from zlib.net is not as optimized as the zlib library that ships with MacOS. The remaining difference in performance is therefore due to an upstream ineffiency. I don’t know if we can find a faster implementation of zlib to use in CodecZlib - presumably MacOS can make certain assumptions about what OS and CPU their users have which Julia can’t.

14 Likes