lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

Author here. This is a benchmark. We would like to compare different implementations of the same algorithm. Klib.jl is the most relevant here. I want to thank @bicycle1885 again for improving this Julia implementation. I really appreciate.

My main concern with Julia is its silent performance traps: correct results but poor performance. The poor performance here is caused by small details like length vs sizeof and type conversion. These are not apparent to new users. I guess even experienced devs may need to pay extra attention. Currently, Julia is still slow on bedcov. I probably messed up typing somewhere, but I am not sure where, given that the code is just doing array accesses.

I am more impressed by Crystal to be honest. My first working Crystal implementation is close to the reported speed. The Nim FASTQ parser is also fast on my first try. A good language should set few traps.

As to other points in this thread:

  • This is not a zlib benchmark. The reported number now comes from the system zlib, thanks to @bicycle1885’s suggestion. I can understand why Julia ships its own zlib, but this did catch me off guard, again.

  • The klib fastx parser is more flexible. It parses multi-line fasta and fastq at the same time. You don’t need to tell klib the input format. At least this is an important feature to me. With a regex parser, Fastx.jl can’t do it. In addition, klib has no Julia dependencies and is more than 10 times faster to compile. Due to the long startup of Julia, lightweight libraries are more appreciated.

  • Productivity: on these two tasks, Crystal and Nim are also expressive. I spent less time on Crystal and Nim to achieve descent performance. I still have performance issues with Julia on bedcov.

  • When you read through a 100Gb gzip’d fastq file, performance matters. 30min vs 1hr is a huge difference. High-performance tools often put fastq reading in a separate thread because it is too slow. Zlib is the main bottleneck here, but parsing time should be minimized as well.

  • Fastx.jl. I installed Fastx.jl following the instruction on its website. I got 1.0.0. It didn’t work with CodecZlib. I manually fixed that.

  • Bio.jl. I didn’t know Fastx.jl. I googled biojulia and went to biojulia.net. The website has no documentations. It doesn’t tell me Fastx.jl is the right way to parse fastq. I thought Biojulia is like all the other Bio* projects with a single module. I ended up with a Bio.jl script first. Also, Bio.jl is often the top google search. For example, you can try “biojulia interval overlap”. By the way, I wanted to implement bedcov with Biojulia. I couldn’t find the right API and gave up.

12 Likes