lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

from the community news:
" Biofast is a small benchmark for evaluating the performance of programming languages and implementations on a few common tasks in the field of Bioinformatics. It currently includes two benchmarks: FASTQ parsing and interval query."

Please help to improve the Julia part. :slight_smile:

I recreated the C, Julia and Python part of the first benchmark. A few observations:

The library Klib.jl is a bit obscure, seemingly just used by one guy, and not very well maintained. I don’t think it’s useful to pick that code apart, it’s just one random guy’s Julia code. It’s much more interesting to look at the FASTQ benchmark using FASTX.jl from BioJulia.

  • I don’t see the same relative numbers b/w C, Python and Julia that he does - for me, Julia does about 30% better on non-zipped data. It might be a matter of hardware, or the fact that I ran FASTX v 1.1, not 1.0.

  • Looking at his Julia code, it looks great. There’s some type instability, but union splitting takes care of that, and it’s no performance concern.

  • For the FASTX code, I noticed the FASTQ parser does not use inbounds. Since it reads the input byte by byte, this has a rather large effect, taking another 30% time off the non-zipped input. With this optimization, the running speed of Julia is about 1.25-1.3 times as long as his C code. Not bad! We probably shouldn’t actually remove the boundschecks, because IMO the extra safety is worth a little extra time, especially considering that FASTQ files are essentially always gzipped.

  • For the gzipped data, it appears that CodecZlib is very slow, something like 4x slower than zlib. That’s very strange, considering it just calls zlib directly. Profiling confirms that almost all time is spent in the ccall line. I created this issue - Jaakko Ruohio on Slack discovered about half the problem was that zlib was not compiled with -O3, but there are still more gains to be had, I just don’t know where to find them. After this, Julia is about 1.4 slower than C on gzipped data - whereas we ought to be very near C speed here, something like a factor of 1.1. If anyone can fix the CodecZlib issue, that would be great.

  • The implementation of the FASTQ format and the parsing is very efficient. Except for the boundscheck issue and the zlib issue, the only place left I can find is to optimize Automa.jl directly - which would always be welcome, but I don’t know how.

  • Edit: Oh yes, and the elephant in the room: The Julia code goes through the 5 million FASTQ read in 5 seconds, whereas it takes around 11 when called from command line due to compile time latency. That puts us way behind C speed, just a tad before Python.

Edit2: It seems the zlib.h source code that CodecZlib obtains from zlib.net is not as optimized as the zlib library that ships with MacOS. The remaining difference in performance is therefore due to an upstream ineffiency. I don’t know if we can find a faster implementation of zlib to use in CodecZlib - presumably MacOS can make certain assumptions about what OS and CPU their users have which Julia can’t.

14 Likes

The other elephant in the room is how much does any of this actually matter vs. other perhaps less benchmarkable aspects of productivity gained using julia and BioJulia?

4 Likes

Indeed - that becomes a much more involved discussion than pure performance and can’t easily be resolved with tables and measurements. Nonetheless, I do think it’s nice to have the foundation of Julia bioinfo packages be so fast that people would not be turned away from implementing highly performance-sensitive tools it in, like a short-read aligner or an assembler.

W.r.t the broader implicaitons of Julia, first notice the absolute timings. On my laptop, Julia crunches 3.6 million reads/sec uncompressed and something like 1.6 m/s compressed. It’s hard to think of a real-life application where any useful work could be done so quick that these timings matter much.

Second, I think it’s very instructive to compare the approach to FASTQ parsing represented by klib.h vs FASTX.jl. Have a look at the source. FASTX implements its parser through a high-level description of the FASTQ format thanks to Automa and lets the machine figure out most of the gritty details. In contrast, look at the C code, here the entire parsing is written by hand. I know which parser I’d rather write myself, and which parser I’d trust results from (assuming Automa itself is well tested). Also, I had some fun trying to break the parsers by seeing what kind of broken FASTQ files they would accept. Unsurprisingly, FASTX’s parser is more robust (because it’s failsafe by design!).

10 Likes

At the risk of sounding catty. I also despair at people STILL complaining about Bio.jl loading. It has been unsupported for long enough, and the header in the readme of the repo is quite clear, and the status badge says “inactive”.

Also, reading:

“Probably my Julia implementations here will get most slaps. I have seen quite a few you-are-holding-the-phone-wrong type of responses from Julia supporters.”

Maybe I’m crazy, but essentially saying “oh I knew my implementation would be criticised” … when we can see the implementation IS affecting the benchmark… is not a valid defence of the implementation.

3 Likes

Hm, maybe there’s actually something actionable there. Like making a large red banner on the Bio.jl GitHub, or printing a warning when you import Bio.jl

2 Likes

Yeah I’m going to take its site down almost entirely and put up a great big stop sign.

1 Like

I think we should release a new patch version where the only change is that it prints a warning saying that it’s deprecated when people try to load it.

4 Likes

Just posted a pull request to improve the performance of the fqcnt_jl1_klib.jl script.

4 Likes

This line is the most interesting I see in the post

“Also importantly, the Julia developers do not value backward compatibility. There may be a python2-to-3 like transition in several years if they still hold their views by then. I wouldn’t take the risk.”

What’s your take?

Yeah, it’s not like we run the tests for all registered packages on every release, fix the problems in Julia they reveal, make patches to make the internals more backward compatible if we see people use it, open PRs and issues on packages that used internals that is now changed etc etc.

And it’s not like I have a spreadsheet with e.g. all the test regressions for packages in 1.5 that we are looking into (even though 90% of the time the test errors are due to bad tests in packages or relying on Julia internals). Oh, wait, I do, it’s here: https://docs.google.com/spreadsheets/d/1Jiw1CnXGhA-_yB2pLSxDQyuz176Za79hbey2SvnKfGA/edit?usp=sharing.

20 Likes

That’s actually pretty cool! It’s great insight!

I don’t know why someone from outside the Julia community can form such a view, or how prevalent is that view.

Perhaps he got burned real bad during Python 2 -> 3, and he’s “worried” about Julia 1 -> 2 transition?

I think it comes, fairly or not, from the transition to Julia 1.0, which a lot of people found disruptive.

It doesn’t follow that future releases will be so, and things seem to have stabilised massively since then.

3 Likes

Ok i forgot about that. It’s about expectations then. I never expected 0.6 to 1 be no breaking changes. Massive efforts were put into 0.7 to make the transition easier. So yeah.

Actually, I expected things to break from 0.6 to 1.

Surely, things broke real bad when Python 1 -> 2 but maybe PYthon wasn’t as big when the transition happend and hence it was this huge drama like 2->3 was.

1 Like

Yes, there has been 0.7, but that has been the latest released version for about 24 hours or so, so many people skipped it in practice :smile:

3 Likes

Didn’t dig through the spreadsheet (I’m on mobile), but I’m guessing your not pulling into from registries other than General, right?

To be clear, I think that would be a totally reasonable choice, but another thing us BioJulia folks need to consider with our separate registry. I don’t know if you’ve ever publicized that spreadsheet before, but it seems likely that there are other “unofficial” community support things that we don’t have access to. @Ward9250 @bicycle1885 something to think about

Actually, now I think about it, we should probably do this for any package that was registered on General but it’s now in the BioJulia registry.

Nope.

2 Likes

It’s interesting to see how fast the Julia code is (after tuning), and while also how condensed it is (could be a bit more), and same for Crystal language code.

I see “430 lines (382 sloc)” for a file, but it’s really 308 sloc, smaller than Crystal (at least for this one file, and corresponding one). https://github.com/lh3/biofast/blob/master/lib/Klib.jl

It’s good to have in mind that “sloc” is misrepresenting, counting triple quoted comments with (yes, at least one comment had code example).

C, and Crystal (nearly identical speed) are the languages to beat (on that metric too), and we’re close.

Author here. This is a benchmark. We would like to compare different implementations of the same algorithm. Klib.jl is the most relevant here. I want to thank @bicycle1885 again for improving this Julia implementation. I really appreciate.

My main concern with Julia is its silent performance traps: correct results but poor performance. The poor performance here is caused by small details like length vs sizeof and type conversion. These are not apparent to new users. I guess even experienced devs may need to pay extra attention. Currently, Julia is still slow on bedcov. I probably messed up typing somewhere, but I am not sure where, given that the code is just doing array accesses.

I am more impressed by Crystal to be honest. My first working Crystal implementation is close to the reported speed. The Nim FASTQ parser is also fast on my first try. A good language should set few traps.

As to other points in this thread:

  • This is not a zlib benchmark. The reported number now comes from the system zlib, thanks to @bicycle1885’s suggestion. I can understand why Julia ships its own zlib, but this did catch me off guard, again.

  • The klib fastx parser is more flexible. It parses multi-line fasta and fastq at the same time. You don’t need to tell klib the input format. At least this is an important feature to me. With a regex parser, Fastx.jl can’t do it. In addition, klib has no Julia dependencies and is more than 10 times faster to compile. Due to the long startup of Julia, lightweight libraries are more appreciated.

  • Productivity: on these two tasks, Crystal and Nim are also expressive. I spent less time on Crystal and Nim to achieve descent performance. I still have performance issues with Julia on bedcov.

  • When you read through a 100Gb gzip’d fastq file, performance matters. 30min vs 1hr is a huge difference. High-performance tools often put fastq reading in a separate thread because it is too slow. Zlib is the main bottleneck here, but parsing time should be minimized as well.

  • Fastx.jl. I installed Fastx.jl following the instruction on its website. I got 1.0.0. It didn’t work with CodecZlib. I manually fixed that.

  • Bio.jl. I didn’t know Fastx.jl. I googled biojulia and went to biojulia.net. The website has no documentations. It doesn’t tell me Fastx.jl is the right way to parse fastq. I thought Biojulia is like all the other Bio* projects with a single module. I ended up with a Bio.jl script first. Also, Bio.jl is often the top google search. For example, you can try “biojulia interval overlap”. By the way, I wanted to implement bedcov with Biojulia. I couldn’t find the right API and gave up.

12 Likes