lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

jebej · May 22, 2020, 9:28pm

There is no explanation on the BioJulia as to what packages are available. Maybe put up a “Packages” tab. There are also no links to the documentation.

When you click on “getting started” on the website, you are told to add the registry and then… nothing. If I was trying to install some of these packages I would indeed be confused.

anon56330260 · May 23, 2020, 3:27am

Some comments here on bedcov:
1.Your implementation of Klib.jl on the master branch, which bencov relies on, is not identical to Crystal’s or Nim’s. See the function it_overlap!. In Julia, it uses an empty vector as stack by popping and pushing elements. However in Crystal or Nim, it preallocates a fixed-size staticarray and use a integer to track top of the stack. I don’t know why you use two different approahes here, cause it’s also quite straightforward to translate Nim/Crystal to Julia.
2.Still the function it_overlap!. In Julia, you push the result into an Vector b. But in Crystal/Nim, since they have a builtin-in yield keyword, it doesn’t create a temporary array. In Julia, you need to import some packages to achieve yield’s function or manually use Channel and Task.
3.Your comment

Productivity:on these two tasks, Crystal and Nim are also expressive.

are not quite exact. Let’s take the function it_index! for example. In Crystal:

def self.index(a : Array(Interval(SType, DType)))
		a.sort_by!{|x| x.st}
		last, last_i, i = 0, 1, 0
		while i < a.size
			last, last_i = a[i].en, i
			a[i] = Interval.new(a[i].st, a[i].en, a[i].en, a[i].data)
			i += 2
		end
		k = 1

What’s the type of last? You first declare it as 0, which I guess it’s a Integer(I don’t know much about Crystal, only some Ruby). However it’s then be assigned as a[i].en, which is type of SType. In Julia, you correctly avoid this type-unstable by convert 0 to SType:

last::S = 0

You are lucky that you only use Integer in Crystal’s test. It you use float number then it will be either type-unstable or unable to pass type-checking.
Nim is powerful anyway, it can also declare type of last as SType, however, it is tricky here: it doesn’t initialize last as zero. To do these you either need to have some forms of “class methods” (methods defined on classes instead of objects) so you can write S.zero(), or use multiple dispatch. But problems still persists if you use class methods. You are not allowed to extend the methods of class once it gets defined. This is the common problems of OOP.Of course, Python and Ruby are quite flexible and the class es are open, but I don’t think that’s the case in Crystal and Nim.
Crystal is intended to be a static Ruby. One of the application of Ruby is Ruby on rails, a web server framework. Since you don’t really write complex types while developping web server, Crystal doesn’t have a delicate type system(compared to typescript, it has a turing-complete type system to model types in Javascript world). It just means that Crystal’s program will never be as expressive as Julia’s.

lh3 · May 23, 2020, 4:47am

A good catch on last in Crystal. This can be fixed by last = a[0].en . I agree Julia typing helps here, though in general I more like the Crystal way: I don’t need to watch out type mismatches to avoid silent performance bugs.

Does Julia natively support static arrays or do I have to use StaticArray.jl? I am hesitant to add a large library just for 60 lines of code especially given that Julia is slow on compilation.

In C++, resizing an allocated vector to 0 reserves its original capacity. I am expecting this behavior when calling resize!(b, 0) . If this is true, I don’t think the performance is affected much given that array b is most often preallocated.

Anyway, if you can improve the performance of bedcov, send me a PR. I prefer not to use an external library, but if doing that helps a lot, I can accept.

EDIT: deleted the old post which is not replying to @anon56330260.

Elrod · May 23, 2020, 5:35am

It does not.

julia> using BenchmarkTools

julia> push10000times!(x) = foreach(i -> push!(x, i), 1:10_000)
push10000times! (generic function with 1 method)

julia> @benchmark push10000times!(x) setup = (x = Int[])
BenchmarkTools.Trial:
  memory estimate:  256.56 KiB
  allocs estimate:  13
  --------------
  minimum time:     44.920 μs (0.00% GC)
  median time:      47.138 μs (0.00% GC)
  mean time:        49.160 μs (0.77% GC)
  maximum time:     666.649 μs (84.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark push10000times!(x) setup = (x = resize!(Vector{Int}(undef,10_000), 0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.499 μs (0.00% GC)
  median time:      32.323 μs (0.00% GC)
  mean time:        34.246 μs (0.00% GC)
  maximum time:     75.547 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark copyto!(x, axes(x,1)) setup = (x = Vector{Int}(undef,10_000))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.319 μs (0.00% GC)
  median time:      1.569 μs (0.00% GC)
  mean time:        1.635 μs (0.00% GC)
  maximum time:     6.534 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

anon56330260 · May 23, 2020, 6:05am

StaticArray is not necessary, you can just use a Vector, though might a little slower than StaticArray. With @inbounds, Julia’s Vector is fast enough.

In C++, resizing an allocated vector to 0 reserves its original capacity. I am expecting this behavior when calling resize!(b, 0) . If this is true, I don’t think the performance is affected much given that array b is most often preallocated.

I think you misunderstood. In main function, you call Klib.it_overlap!(a, st0, en0, b), and b = Vector{Klib.Interval{Int32,Int32}}(), so in Klib.it_overlap!(a, st0, en0, b) you push a large number of Interval into Vector. However, In Nim&Crystal, you use a generator, which means that you don’t have to save all temporary results.
Also using resize! is apparently an anti-pattern. Why you resize! an empty vector, it makes no sense. I think what you want is:

function it_overlap(a::Vector{Interval{S,T}}, st::S, en::S) where{S,T}
it_overlap!(a,st,en,Interval{S,T}[])#collect all elements into an empty array
return b
end
function it_overlap!(a::Vector{Interval{S,T}}, st::S, en::S, b::Vector{Interval{S,T}}) where {S,T}
#push elements into b
end

That is, you have two versions of function. One is modified version, another pass empty vector into the modified version. Of course, this actually has little impacts on performance.

lh3 · May 23, 2020, 1:12pm

In C, it is a common practice to preallocate a buffer and reuse it to reduce the frequency of heap allocations. The Julia one is supposed to do this. Note that b is declared outside all loops. If resize!(b, 0) doesn’t shrink the capacity of b, GC should be triggered rarely.

I have also tried to use Channel following this blog post. On a small output is it several times slower than my current implementation – Channel is a bad substitution of generator.

Anyway, sending me a faster PR is the best way to prove your point. Let numbers speak.

Ward9250 · May 23, 2020, 7:48pm

Ok, so let’s turn this into something or something(s) actionable for BioJulia?

I can think and start with 2:

Point taken on the getting started website section. It is not finished. I didn’t want to maintain a list of packages on there manually, but now we have the bio-specific registry I can either try to use DocumentationGenerator or cook something else up that builds a package listing for the website.

Fixing the bug I think you mentioned in FASTX would be number two.

lh3 · May 23, 2020, 10:51pm

The Fastx.jl bug has been fixed in later versions. However, when I installed Fastx.jl, I only got 1.0.0.

I would try best to make Bio.jl invisible. Extreme actions include archiving github and taking down the documentation website (put a PDF there for remaining users instead). Less extreme actions could be to put a deprecation banner on all Bio.jl related pages and ask google not to index Bio.jl pages. At the very least, write something like “this repo is deprecated” in the github description of Bio.jl in addition to README, and mention Bio.jl is deprecated at Biojulia.net.

As an outsider, I am not sure why you maintain a separate registry. This reduces the visibility of Biojulia packages and makes it more difficult to install them. It may be better to move mature packages to the main registry.

Ward9250 · May 23, 2020, 11:07pm

When the package manager - in it’s current state was released, with it’s features for maintaining multiple registries and the like there came talk of having organisation and topic specific registries. I thought it was really cool and jumped to be an early adopter as I could - but at the same time, searching and discovering julia packages in General became much easier and pleasant experience. Now it is how it is. It might be easier after all to just have general absorb it all, and have good tagging and instructions on the website on how to search the package list for biojulia packages.

jgreener64 · May 23, 2020, 11:31pm

I am a fan of this proposal.

I’ll be honest, I’ve always been sceptical of the BioJulia registry, and never saw enough benefit to move BioStructures.jl to it. Not that I think it’s bad, I just never saw the big advantage.

I think putting more on biojulia.net is essential, e.g. a list of packages and a couple of examples of how they can be used interoperably as part of an ecosystem.

I am also aware that this kind of work has fallen to a small number of people in the past, most notably Ben, so am happy to help as appropriate.

We have great tools here but the tools are better than the docs, discoverability and overall user experience.

Ward9250 · May 23, 2020, 11:53pm

Ok then, I’ll begin the transition.

Yeah it might be good to come up with notebooks and tutorials showing how you can piece together the ecosystem.

We should also perhaps have clear instructions on how to avoid the more common julia pitfalls, if we can stop new users from getting that holding the phone wrong response in the first place it might make a better first impression.

StefanKarpinski · May 24, 2020, 1:15am

For what it’s worth, it was never the intention for there to be multiple public registries of open source projects. The purpose of supporting multiple registries was to allow fully private (not available to the public) and semi-private registries (available but meant for use within an organization/group). Having multiple public registries of open source projects just hurts their discoverability, IMO, as seems to be the case here. The requirements on the General registry are extremely minimal. Of course we can’t force people to not have separate registries, but it doesn’t seem like a good thing to me.

Ward9250 · May 24, 2020, 2:10am

Ok, the process has begun with BioGenerics, BioSymbols, and FormatSpecimens but it’s 3am so I’m gonna crash. In the comming days, if anyone observing General, with the power to merge anything, notices that CI has lit green something that I’m migrating, that is sat waiting for the 3 day wait for new packages, that would help make it a lot less painful.

dilumaluthge · May 24, 2020, 2:13am

I’ll keep an eye on the registry.

GunnarFarneback · May 24, 2020, 8:50am

Maybe I’m dense or there are complications related to registry CI but wouldn’t it be easier and better to do a wholesale merge of the BioJulia registry into General in a single PR? Assuming there are no name clashes, the package directories (with Package.toml, Versions.toml, Deps.toml, and Compat.toml) can just be copied in. The entries in the [packages] section of Registry.toml would need to be concatenated and sorted, which is easy enough. As a major benefit the historical versions of the packages would be retained.

Edit: Work in progress merge support in LocalRegistry: [WIP] Support merging of registries. by GunnarFarneback · Pull Request #14 · GunnarFarneback/LocalRegistry.jl · GitHub

Ward9250 · May 24, 2020, 11:12am

There are packages that have different versions that span both registries, and I don’t trust myself to manually sort that out in a PR without human error creeping in. But for the packages with no previous entries in General it would be easier.

GunnarFarneback · May 24, 2020, 2:30pm

Manually sounds barbaric but let there be tools. Try this and see if you like the result in /tmp/General (adjust the paths as necessary if you’re not running Linux):

using Pkg
pkg"add LocalRegistry#merge"
using LocalRegistry
run(`git clone https://github.com/JuliaRegistries/General.git /tmp/General`)
run(`git clone https://github.com/BioJulia/BioJuliaRegistry.git /tmp/BioJuliaRegistry`)
LocalRegistry.merge("/tmp/General", "/tmp/BioJuliaRegistry", merge_packages = true)

Ward9250 · May 24, 2020, 3:39pm

I tried this this morning - I totally did not clock the merge_packages option facepalm.

GunnarFarneback · May 24, 2020, 3:40pm

It probably didn’t exist at the time.

Ward9250 · May 24, 2020, 5:04pm

OMG that merge function has saved my sanity. Thank you!

Topic		Replies	Views
On the performance and design of BioSequences compared to the Seq language Community	0	404	January 25, 2020
Falling Behind - Julia for genomics? Biology, Health, and Medicine	7	914	July 8, 2024
Small benchmark Performance benchmark	14	2729	November 21, 2018
"A Comparison of Three Programming Languages for a Full-Fledged Next-Generation Sequencing Tool" selects...Go? Biology, Health, and Medicine	5	1801	February 25, 2019
Julia programs now shown on benchmarks game website Community announcement	144	13675	December 3, 2019

lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

Related topics