lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

ImreSamu · May 19, 2020, 11:16am

from the community news:
" Biofast is a small benchmark for evaluating the performance of programming languages and implementations on a few common tasks in the field of Bioinformatics. It currently includes two benchmarks: FASTQ parsing and interval query."

Please help to improve the Julia part.

blogpost: Fast high-level programming languages
HN comments: Fast high-level programming languages | Hacker News

jakobnissen · May 19, 2020, 6:35pm

I recreated the C, Julia and Python part of the first benchmark. A few observations:

The library Klib.jl is a bit obscure, seemingly just used by one guy, and not very well maintained. I don’t think it’s useful to pick that code apart, it’s just one random guy’s Julia code. It’s much more interesting to look at the FASTQ benchmark using FASTX.jl from BioJulia.

I don’t see the same relative numbers b/w C, Python and Julia that he does - for me, Julia does about 30% better on non-zipped data. It might be a matter of hardware, or the fact that I ran FASTX v 1.1, not 1.0.
Looking at his Julia code, it looks great. There’s some type instability, but union splitting takes care of that, and it’s no performance concern.
For the FASTX code, I noticed the FASTQ parser does not use inbounds. Since it reads the input byte by byte, this has a rather large effect, taking another 30% time off the non-zipped input. With this optimization, the running speed of Julia is about 1.25-1.3 times as long as his C code. Not bad! We probably shouldn’t actually remove the boundschecks, because IMO the extra safety is worth a little extra time, especially considering that FASTQ files are essentially always gzipped.
For the gzipped data, it appears that CodecZlib is very slow, something like 4x slower than zlib. That’s very strange, considering it just calls zlib directly. Profiling confirms that almost all time is spent in the ccall line. I created this issue - Jaakko Ruohio on Slack discovered about half the problem was that zlib was not compiled with -O3, but there are still more gains to be had, I just don’t know where to find them. After this, Julia is about 1.4 slower than C on gzipped data - whereas we ought to be very near C speed here, something like a factor of 1.1. If anyone can fix the CodecZlib issue, that would be great.
The implementation of the FASTQ format and the parsing is very efficient. Except for the boundscheck issue and the zlib issue, the only place left I can find is to optimize Automa.jl directly - which would always be welcome, but I don’t know how.
Edit: Oh yes, and the elephant in the room: The Julia code goes through the 5 million FASTQ read in 5 seconds, whereas it takes around 11 when called from command line due to compile time latency. That puts us way behind C speed, just a tad before Python.

Edit2: It seems the zlib.h source code that CodecZlib obtains from zlib.net is not as optimized as the zlib library that ships with MacOS. The remaining difference in performance is therefore due to an upstream ineffiency. I don’t know if we can find a faster implementation of zlib to use in CodecZlib - presumably MacOS can make certain assumptions about what OS and CPU their users have which Julia can’t.

Ward9250 · May 20, 2020, 12:58pm

The other elephant in the room is how much does any of this actually matter vs. other perhaps less benchmarkable aspects of productivity gained using julia and BioJulia?

jakobnissen · May 20, 2020, 1:11pm

Indeed - that becomes a much more involved discussion than pure performance and can’t easily be resolved with tables and measurements. Nonetheless, I do think it’s nice to have the foundation of Julia bioinfo packages be so fast that people would not be turned away from implementing highly performance-sensitive tools it in, like a short-read aligner or an assembler.

W.r.t the broader implicaitons of Julia, first notice the absolute timings. On my laptop, Julia crunches 3.6 million reads/sec uncompressed and something like 1.6 m/s compressed. It’s hard to think of a real-life application where any useful work could be done so quick that these timings matter much.

Second, I think it’s very instructive to compare the approach to FASTQ parsing represented by klib.h vs FASTX.jl. Have a look at the source. FASTX implements its parser through a high-level description of the FASTQ format thanks to Automa and lets the machine figure out most of the gritty details. In contrast, look at the C code, here the entire parsing is written by hand. I know which parser I’d rather write myself, and which parser I’d trust results from (assuming Automa itself is well tested). Also, I had some fun trying to break the parsers by seeing what kind of broken FASTQ files they would accept. Unsurprisingly, FASTX’s parser is more robust (because it’s failsafe by design!).

Ward9250 · May 20, 2020, 2:41pm

At the risk of sounding catty. I also despair at people STILL complaining about Bio.jl loading. It has been unsupported for long enough, and the header in the readme of the repo is quite clear, and the status badge says “inactive”.

Also, reading:

“Probably my Julia implementations here will get most slaps. I have seen quite a few you-are-holding-the-phone-wrong type of responses from Julia supporters.”

Maybe I’m crazy, but essentially saying “oh I knew my implementation would be criticised” … when we can see the implementation IS affecting the benchmark… is not a valid defence of the implementation.

jakobnissen · May 20, 2020, 6:48pm

Hm, maybe there’s actually something actionable there. Like making a large red banner on the Bio.jl GitHub, or printing a warning when you import Bio.jl

Ward9250 · May 20, 2020, 11:40pm

Yeah I’m going to take its site down almost entirely and put up a great big stop sign.

kevbonham · May 21, 2020, 12:16am

I think we should release a new patch version where the only change is that it prints a warning saying that it’s deprecated when people try to load it.

bicycle1885 · May 21, 2020, 3:58am

Just posted a pull request to improve the performance of the fqcnt_jl1_klib.jl script.

github.com/lh3/biofast

Improve performance of fqcnt_jl1_klib.jl

lh3:master ← bicycle1885:perf-julia

opened 03:57AM - 21 May 20 UTC

bicycle1885

+15 -10

Hi, I've slightly improved the performance of fqcnt_jl1_klib.jl. Here are th…e step-by-step improvements from 759cd6a to 15d9864 on my machine: ``` HEAD is now at 759cd6a use bit operators Benchmark #1: fqcnt/fqcnt_jl1_klib.jl biofast-data-v1/M_abscessus_HiSeq.fq Time (mean ± σ): 4.280 s ± 0.083 s [User: 3.923 s, System: 0.838 s] Range (min … max): 4.134 s … 4.398 s 10 runs Previous HEAD position was 759cd6a use bit operators HEAD is now at 7fd4ff8 make readbyte type stable Benchmark #1: fqcnt/fqcnt_jl1_klib.jl biofast-data-v1/M_abscessus_HiSeq.fq Time (mean ± σ): 3.717 s ± 0.074 s [User: 3.390 s, System: 0.793 s] Range (min … max): 3.620 s … 3.834 s 10 runs Previous HEAD position was 7fd4ff8 make readbyte type stable HEAD is now at f1b09c2 do not count string length twice Benchmark #1: fqcnt/fqcnt_jl1_klib.jl biofast-data-v1/M_abscessus_HiSeq.fq Time (mean ± σ): 2.834 s ± 0.046 s [User: 2.513 s, System: 0.805 s] Range (min … max): 2.769 s … 2.936 s 10 runs Previous HEAD position was f1b09c2 do not count string length twice HEAD is now at 2984dc8 use unsafe_string to make strings Benchmark #1: fqcnt/fqcnt_jl1_klib.jl biofast-data-v1/M_abscessus_HiSeq.fq Time (mean ± σ): 2.380 s ± 0.026 s [User: 2.101 s, System: 0.755 s] Range (min … max): 2.331 s … 2.411 s 10 runs Previous HEAD position was 2984dc8 use unsafe_string to make strings HEAD is now at 15d9864 use memchr to find delimiter Benchmark #1: fqcnt/fqcnt_jl1_klib.jl biofast-data-v1/M_abscessus_HiSeq.fq Time (mean ± σ): 1.893 s ± 0.040 s [User: 1.615 s, System: 0.756 s] Range (min … max): 1.842 s … 1.995 s 10 runs ``` ``` julia> versioninfo() Julia Version 1.4.1 Commit 381693d3df* (2020-04-14 17:20 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: AMD Ryzen 5 2400G with Radeon Vega Graphics WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, znver1) Environment: JULIA_PROJECT = @. ```

xiaodai · May 21, 2020, 7:44am

This line is the most interesting I see in the post

“Also importantly, the Julia developers do not value backward compatibility. There may be a python2-to-3 like transition in several years if they still hold their views by then. I wouldn’t take the risk.”

What’s your take?

kristoffer.carlsson · May 21, 2020, 7:52am

Yeah, it’s not like we run the tests for all registered packages on every release, fix the problems in Julia they reveal, make patches to make the internals more backward compatible if we see people use it, open PRs and issues on packages that used internals that is now changed etc etc.

And it’s not like I have a spreadsheet with e.g. all the test regressions for packages in 1.5 that we are looking into (even though 90% of the time the test errors are due to bad tests in packages or relying on Julia internals). Oh, wait, I do, it’s here: PkgEval 1.5 - Google Sheets.

xiaodai · May 21, 2020, 8:04am

That’s actually pretty cool! It’s great insight!

I don’t know why someone from outside the Julia community can form such a view, or how prevalent is that view.

Perhaps he got burned real bad during Python 2 → 3, and he’s “worried” about Julia 1 → 2 transition?

jgreener64 · May 21, 2020, 9:24am

I think it comes, fairly or not, from the transition to Julia 1.0, which a lot of people found disruptive.

It doesn’t follow that future releases will be so, and things seem to have stabilised massively since then.

xiaodai · May 21, 2020, 9:27am

Ok i forgot about that. It’s about expectations then. I never expected 0.6 to 1 be no breaking changes. Massive efforts were put into 0.7 to make the transition easier. So yeah.

Actually, I expected things to break from 0.6 to 1.

Surely, things broke real bad when Python 1 → 2 but maybe PYthon wasn’t as big when the transition happend and hence it was this huge drama like 2->3 was.

giordano · May 21, 2020, 9:35am

Yes, there has been 0.7, but that has been the latest released version for about 24 hours or so, so many people skipped it in practice

kevbonham · May 21, 2020, 4:46pm

Didn’t dig through the spreadsheet (I’m on mobile), but I’m guessing your not pulling into from registries other than General, right?

To be clear, I think that would be a totally reasonable choice, but another thing us BioJulia folks need to consider with our separate registry. I don’t know if you’ve ever publicized that spreadsheet before, but it seems likely that there are other “unofficial” community support things that we don’t have access to. @Ward9250 @bicycle1885 something to think about

kevbonham · May 21, 2020, 4:47pm

Actually, now I think about it, we should probably do this for any package that was registered on General but it’s now in the BioJulia registry.

kristoffer.carlsson · May 21, 2020, 5:41pm

Nope.

Palli · May 22, 2020, 12:47pm

It’s interesting to see how fast the Julia code is (after tuning), and while also how condensed it is (could be a bit more), and same for Crystal language code.

I see “430 lines (382 sloc)” for a file, but it’s really 308 sloc, smaller than Crystal (at least for this one file, and corresponding one). biofast/Klib.jl at master · lh3/biofast · GitHub

It’s good to have in mind that “sloc” is misrepresenting, counting triple quoted comments with (yes, at least one comment had code example).

C, and Crystal (nearly identical speed) are the languages to beat (on that metric too), and we’re close.

lh3 · May 22, 2020, 8:24pm

Author here. This is a benchmark. We would like to compare different implementations of the same algorithm. Klib.jl is the most relevant here. I want to thank @bicycle1885 again for improving this Julia implementation. I really appreciate.

My main concern with Julia is its silent performance traps: correct results but poor performance. The poor performance here is caused by small details like length vs sizeof and type conversion. These are not apparent to new users. I guess even experienced devs may need to pay extra attention. Currently, Julia is still slow on bedcov. I probably messed up typing somewhere, but I am not sure where, given that the code is just doing array accesses.

I am more impressed by Crystal to be honest. My first working Crystal implementation is close to the reported speed. The Nim FASTQ parser is also fast on my first try. A good language should set few traps.

As to other points in this thread:

This is not a zlib benchmark. The reported number now comes from the system zlib, thanks to @bicycle1885’s suggestion. I can understand why Julia ships its own zlib, but this did catch me off guard, again.
The klib fastx parser is more flexible. It parses multi-line fasta and fastq at the same time. You don’t need to tell klib the input format. At least this is an important feature to me. With a regex parser, Fastx.jl can’t do it. In addition, klib has no Julia dependencies and is more than 10 times faster to compile. Due to the long startup of Julia, lightweight libraries are more appreciated.
Productivity: on these two tasks, Crystal and Nim are also expressive. I spent less time on Crystal and Nim to achieve descent performance. I still have performance issues with Julia on bedcov.
When you read through a 100Gb gzip’d fastq file, performance matters. 30min vs 1hr is a huge difference. High-performance tools often put fastq reading in a separate thread because it is too slow. Zlib is the main bottleneck here, but parsing time should be minimized as well.
Fastx.jl. I installed Fastx.jl following the instruction on its website. I got 1.0.0. It didn’t work with CodecZlib. I manually fixed that.
Bio.jl. I didn’t know Fastx.jl. I googled biojulia and went to biojulia.net. The website has no documentations. It doesn’t tell me Fastx.jl is the right way to parse fastq. I thought Biojulia is like all the other Bio* projects with a single module. I ended up with a Bio.jl script first. Also, Bio.jl is often the top google search. For example, you can try “biojulia interval overlap”. By the way, I wanted to implement bedcov with Biojulia. I couldn’t find the right API and gave up.

Topic		Replies	Views
Julia motivation: why weren't Numpy, Scipy, Numba, good enough? Community history	123	82888	September 21, 2018
Thoughts on eventual Julia 2.0 transition Internals & Design upgrades	47	9107	October 15, 2018
How I learned to stop worrying about being fastest and love microbenchmarks Community	22	3025	November 3, 2023
What do you work on? Why is it important? Community	29	4304	June 7, 2021
Julia programs now shown on benchmarks game website Community announcement	144	13745	December 3, 2019

lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

Related topics