lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

Yes, there has been 0.7, but that has been the latest released version for about 24 hours or so, so many people skipped it in practice :smile:

3 Likes

Didn’t dig through the spreadsheet (I’m on mobile), but I’m guessing your not pulling into from registries other than General, right?

To be clear, I think that would be a totally reasonable choice, but another thing us BioJulia folks need to consider with our separate registry. I don’t know if you’ve ever publicized that spreadsheet before, but it seems likely that there are other “unofficial” community support things that we don’t have access to. @Ward9250 @bicycle1885 something to think about

Actually, now I think about it, we should probably do this for any package that was registered on General but it’s now in the BioJulia registry.

Nope.

2 Likes

It’s interesting to see how fast the Julia code is (after tuning), and while also how condensed it is (could be a bit more), and same for Crystal language code.

I see “430 lines (382 sloc)” for a file, but it’s really 308 sloc, smaller than Crystal (at least for this one file, and corresponding one). https://github.com/lh3/biofast/blob/master/lib/Klib.jl

It’s good to have in mind that “sloc” is misrepresenting, counting triple quoted comments with (yes, at least one comment had code example).

C, and Crystal (nearly identical speed) are the languages to beat (on that metric too), and we’re close.

Author here. This is a benchmark. We would like to compare different implementations of the same algorithm. Klib.jl is the most relevant here. I want to thank @bicycle1885 again for improving this Julia implementation. I really appreciate.

My main concern with Julia is its silent performance traps: correct results but poor performance. The poor performance here is caused by small details like length vs sizeof and type conversion. These are not apparent to new users. I guess even experienced devs may need to pay extra attention. Currently, Julia is still slow on bedcov. I probably messed up typing somewhere, but I am not sure where, given that the code is just doing array accesses.

I am more impressed by Crystal to be honest. My first working Crystal implementation is close to the reported speed. The Nim FASTQ parser is also fast on my first try. A good language should set few traps.

As to other points in this thread:

  • This is not a zlib benchmark. The reported number now comes from the system zlib, thanks to @bicycle1885’s suggestion. I can understand why Julia ships its own zlib, but this did catch me off guard, again.

  • The klib fastx parser is more flexible. It parses multi-line fasta and fastq at the same time. You don’t need to tell klib the input format. At least this is an important feature to me. With a regex parser, Fastx.jl can’t do it. In addition, klib has no Julia dependencies and is more than 10 times faster to compile. Due to the long startup of Julia, lightweight libraries are more appreciated.

  • Productivity: on these two tasks, Crystal and Nim are also expressive. I spent less time on Crystal and Nim to achieve descent performance. I still have performance issues with Julia on bedcov.

  • When you read through a 100Gb gzip’d fastq file, performance matters. 30min vs 1hr is a huge difference. High-performance tools often put fastq reading in a separate thread because it is too slow. Zlib is the main bottleneck here, but parsing time should be minimized as well.

  • Fastx.jl. I installed Fastx.jl following the instruction on its website. I got 1.0.0. It didn’t work with CodecZlib. I manually fixed that.

  • Bio.jl. I didn’t know Fastx.jl. I googled biojulia and went to biojulia.net. The website has no documentations. It doesn’t tell me Fastx.jl is the right way to parse fastq. I thought Biojulia is like all the other Bio* projects with a single module. I ended up with a Bio.jl script first. Also, Bio.jl is often the top google search. For example, you can try “biojulia interval overlap”. By the way, I wanted to implement bedcov with Biojulia. I couldn’t find the right API and gave up.

12 Likes

There is no explanation on the BioJulia as to what packages are available. Maybe put up a “Packages” tab. There are also no links to the documentation.

When you click on “getting started” on the website, you are told to add the registry and then… nothing. If I was trying to install some of these packages I would indeed be confused.

3 Likes

Some comments here on bedcov:
1.Your implementation of Klib.jl on the master branch, which bencov relies on, is not identical to Crystal’s or Nim’s. See the function it_overlap!. In Julia, it uses an empty vector as stack by popping and pushing elements. However in Crystal or Nim, it preallocates a fixed-size staticarray and use a integer to track top of the stack. I don’t know why you use two different approahes here, cause it’s also quite straightforward to translate Nim/Crystal to Julia.
2.Still the function it_overlap!. In Julia, you push the result into an Vector b. But in Crystal/Nim, since they have a builtin-in yield keyword, it doesn’t create a temporary array. In Julia, you need to import some packages to achieve yield's function or manually use Channel and Task.
3.Your comment

Productivity:on these two tasks, Crystal and Nim are also expressive.

are not quite exact. Let’s take the function it_index! for example. In Crystal:

def self.index(a : Array(Interval(SType, DType)))
		a.sort_by!{|x| x.st}
		last, last_i, i = 0, 1, 0
		while i < a.size
			last, last_i = a[i].en, i
			a[i] = Interval.new(a[i].st, a[i].en, a[i].en, a[i].data)
			i += 2
		end
		k = 1

What’s the type of last? You first declare it as 0, which I guess it’s a Integer(I don’t know much about Crystal, only some Ruby). However it’s then be assigned as a[i].en, which is type of SType. In Julia, you correctly avoid this type-unstable by convert 0 to SType:

last::S = 0

You are lucky that you only use Integer in Crystal’s test. It you use float number then it will be either type-unstable or unable to pass type-checking.
Nim is powerful anyway, it can also declare type of last as SType, however, it is tricky here: it doesn’t initialize last as zero. To do these you either need to have some forms of “class methods” (methods defined on classes instead of objects) so you can write S.zero(), or use multiple dispatch. But problems still persists if you use class methods. You are not allowed to extend the methods of class once it gets defined. This is the common problems of OOP.Of course, Python and Ruby are quite flexible and the class es are open, but I don’t think that’s the case in Crystal and Nim.
Crystal is intended to be a static Ruby. One of the application of Ruby is Ruby on rails, a web server framework. Since you don’t really write complex types while developping web server, Crystal doesn’t have a delicate type system(compared to typescript, it has a turing-complete type system to model types in Javascript world). It just means that Crystal’s program will never be as expressive as Julia’s.

4 Likes

A good catch on last in Crystal. This can be fixed by last = a[0].en . I agree Julia typing helps here, though in general I more like the Crystal way: I don’t need to watch out type mismatches to avoid silent performance bugs.

Does Julia natively support static arrays or do I have to use StaticArray.jl? I am hesitant to add a large library just for 60 lines of code especially given that Julia is slow on compilation.

In C++, resizing an allocated vector to 0 reserves its original capacity. I am expecting this behavior when calling resize!(b, 0) . If this is true, I don’t think the performance is affected much given that array b is most often preallocated.

Anyway, if you can improve the performance of bedcov, send me a PR. I prefer not to use an external library, but if doing that helps a lot, I can accept.

EDIT: deleted the old post which is not replying to @ChenNingCong.

2 Likes

It does not.

julia> using BenchmarkTools

julia> push10000times!(x) = foreach(i -> push!(x, i), 1:10_000)
push10000times! (generic function with 1 method)

julia> @benchmark push10000times!(x) setup = (x = Int[])
BenchmarkTools.Trial:
  memory estimate:  256.56 KiB
  allocs estimate:  13
  --------------
  minimum time:     44.920 μs (0.00% GC)
  median time:      47.138 μs (0.00% GC)
  mean time:        49.160 μs (0.77% GC)
  maximum time:     666.649 μs (84.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark push10000times!(x) setup = (x = resize!(Vector{Int}(undef,10_000), 0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.499 μs (0.00% GC)
  median time:      32.323 μs (0.00% GC)
  mean time:        34.246 μs (0.00% GC)
  maximum time:     75.547 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark copyto!(x, axes(x,1)) setup = (x = Vector{Int}(undef,10_000))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.319 μs (0.00% GC)
  median time:      1.569 μs (0.00% GC)
  mean time:        1.635 μs (0.00% GC)
  maximum time:     6.534 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

StaticArray is not necessary, you can just use a Vector, though might a little slower than StaticArray. With @inbounds, Julia’s Vector is fast enough.

In C++, resizing an allocated vector to 0 reserves its original capacity. I am expecting this behavior when calling resize!(b, 0) . If this is true, I don’t think the performance is affected much given that array b is most often preallocated.

I think you misunderstood. In main function, you call Klib.it_overlap!(a, st0, en0, b), and b = Vector{Klib.Interval{Int32,Int32}}(), so in Klib.it_overlap!(a, st0, en0, b) you push a large number of Interval into Vector. However, In Nim&Crystal, you use a generator, which means that you don’t have to save all temporary results.
Also using resize! is apparently an anti-pattern. Why you resize! an empty vector, it makes no sense. I think what you want is:

function it_overlap(a::Vector{Interval{S,T}}, st::S, en::S) where{S,T}
it_overlap!(a,st,en,Interval{S,T}[])#collect all elements into an empty array
return b
end
function it_overlap!(a::Vector{Interval{S,T}}, st::S, en::S, b::Vector{Interval{S,T}}) where {S,T}
#push elements into b
end

That is, you have two versions of function. One is modified version, another pass empty vector into the modified version. Of course, this actually has little impacts on performance.

In C, it is a common practice to preallocate a buffer and reuse it to reduce the frequency of heap allocations. The Julia one is supposed to do this. Note that b is declared outside all loops. If resize!(b, 0) doesn’t shrink the capacity of b, GC should be triggered rarely.

I have also tried to use Channel following this blog post. On a small output is it several times slower than my current implementation – Channel is a bad substitution of generator.

Anyway, sending me a faster PR is the best way to prove your point. Let numbers speak.

4 Likes

Ok, so let’s turn this into something or something(s) actionable for BioJulia?

I can think and start with 2:

Point taken on the getting started website section. It is not finished. I didn’t want to maintain a list of packages on there manually, but now we have the bio-specific registry I can either try to use DocumentationGenerator or cook something else up that builds a package listing for the website.

Fixing the bug I think you mentioned in FASTX would be number two.

3 Likes

The Fastx.jl bug has been fixed in later versions. However, when I installed Fastx.jl, I only got 1.0.0.

I would try best to make Bio.jl invisible. Extreme actions include archiving github and taking down the documentation website (put a PDF there for remaining users instead). Less extreme actions could be to put a deprecation banner on all Bio.jl related pages and ask google not to index Bio.jl pages. At the very least, write something like “this repo is deprecated” in the github description of Bio.jl in addition to README, and mention Bio.jl is deprecated at Biojulia.net.

As an outsider, I am not sure why you maintain a separate registry. This reduces the visibility of Biojulia packages and makes it more difficult to install them. It may be better to move mature packages to the main registry.

3 Likes

When the package manager - in it’s current state was released, with it’s features for maintaining multiple registries and the like there came talk of having organisation and topic specific registries. I thought it was really cool and jumped to be an early adopter as I could - but at the same time, searching and discovering julia packages in General became much easier and pleasant experience. Now it is how it is. It might be easier after all to just have general absorb it all, and have good tagging and instructions on the website on how to search the package list for biojulia packages.

5 Likes

I am a fan of this proposal.

I’ll be honest, I’ve always been sceptical of the BioJulia registry, and never saw enough benefit to move BioStructures.jl to it. Not that I think it’s bad, I just never saw the big advantage.

I think putting more on biojulia.net is essential, e.g. a list of packages and a couple of examples of how they can be used interoperably as part of an ecosystem.

I am also aware that this kind of work has fallen to a small number of people in the past, most notably Ben, so am happy to help as appropriate.

We have great tools here but the tools are better than the docs, discoverability and overall user experience.

4 Likes

Ok then, I’ll begin the transition.

Yeah it might be good to come up with notebooks and tutorials showing how you can piece together the ecosystem.

We should also perhaps have clear instructions on how to avoid the more common julia pitfalls, if we can stop new users from getting that holding the phone wrong response in the first place it might make a better first impression.

10 Likes

For what it’s worth, it was never the intention for there to be multiple public registries of open source projects. The purpose of supporting multiple registries was to allow fully private (not available to the public) and semi-private registries (available but meant for use within an organization/group). Having multiple public registries of open source projects just hurts their discoverability, IMO, as seems to be the case here. The requirements on the General registry are extremely minimal. Of course we can’t force people to not have separate registries, but it doesn’t seem like a good thing to me.

7 Likes

Ok, the process has begun with BioGenerics, BioSymbols, and FormatSpecimens but it’s 3am so I’m gonna crash. In the comming days, if anyone observing General, with the power to merge anything, notices that CI has lit green something that I’m migrating, that is sat waiting for the 3 day wait for new packages, that would help make it a lot less painful.

3 Likes

I’ll keep an eye on the registry.

2 Likes