lhe/Biofast benchmark | FASTQ parsing [Julia,Nim,Crystal,Python,...]

It does not.

julia> using BenchmarkTools

julia> push10000times!(x) = foreach(i -> push!(x, i), 1:10_000)
push10000times! (generic function with 1 method)

julia> @benchmark push10000times!(x) setup = (x = Int[])
BenchmarkTools.Trial:
  memory estimate:  256.56 KiB
  allocs estimate:  13
  --------------
  minimum time:     44.920 μs (0.00% GC)
  median time:      47.138 μs (0.00% GC)
  mean time:        49.160 μs (0.77% GC)
  maximum time:     666.649 μs (84.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark push10000times!(x) setup = (x = resize!(Vector{Int}(undef,10_000), 0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.499 μs (0.00% GC)
  median time:      32.323 μs (0.00% GC)
  mean time:        34.246 μs (0.00% GC)
  maximum time:     75.547 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark copyto!(x, axes(x,1)) setup = (x = Vector{Int}(undef,10_000))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.319 μs (0.00% GC)
  median time:      1.569 μs (0.00% GC)
  mean time:        1.635 μs (0.00% GC)
  maximum time:     6.534 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

StaticArray is not necessary, you can just use a Vector, though might a little slower than StaticArray. With @inbounds, Julia’s Vector is fast enough.

In C++, resizing an allocated vector to 0 reserves its original capacity. I am expecting this behavior when calling resize!(b, 0) . If this is true, I don’t think the performance is affected much given that array b is most often preallocated.

I think you misunderstood. In main function, you call Klib.it_overlap!(a, st0, en0, b), and b = Vector{Klib.Interval{Int32,Int32}}(), so in Klib.it_overlap!(a, st0, en0, b) you push a large number of Interval into Vector. However, In Nim&Crystal, you use a generator, which means that you don’t have to save all temporary results.
Also using resize! is apparently an anti-pattern. Why you resize! an empty vector, it makes no sense. I think what you want is:

function it_overlap(a::Vector{Interval{S,T}}, st::S, en::S) where{S,T}
it_overlap!(a,st,en,Interval{S,T}[])#collect all elements into an empty array
return b
end
function it_overlap!(a::Vector{Interval{S,T}}, st::S, en::S, b::Vector{Interval{S,T}}) where {S,T}
#push elements into b
end

That is, you have two versions of function. One is modified version, another pass empty vector into the modified version. Of course, this actually has little impacts on performance.

In C, it is a common practice to preallocate a buffer and reuse it to reduce the frequency of heap allocations. The Julia one is supposed to do this. Note that b is declared outside all loops. If resize!(b, 0) doesn’t shrink the capacity of b, GC should be triggered rarely.

I have also tried to use Channel following this blog post. On a small output is it several times slower than my current implementation – Channel is a bad substitution of generator.

Anyway, sending me a faster PR is the best way to prove your point. Let numbers speak.

4 Likes

Ok, so let’s turn this into something or something(s) actionable for BioJulia?

I can think and start with 2:

Point taken on the getting started website section. It is not finished. I didn’t want to maintain a list of packages on there manually, but now we have the bio-specific registry I can either try to use DocumentationGenerator or cook something else up that builds a package listing for the website.

Fixing the bug I think you mentioned in FASTX would be number two.

3 Likes

The Fastx.jl bug has been fixed in later versions. However, when I installed Fastx.jl, I only got 1.0.0.

I would try best to make Bio.jl invisible. Extreme actions include archiving github and taking down the documentation website (put a PDF there for remaining users instead). Less extreme actions could be to put a deprecation banner on all Bio.jl related pages and ask google not to index Bio.jl pages. At the very least, write something like “this repo is deprecated” in the github description of Bio.jl in addition to README, and mention Bio.jl is deprecated at Biojulia.net.

As an outsider, I am not sure why you maintain a separate registry. This reduces the visibility of Biojulia packages and makes it more difficult to install them. It may be better to move mature packages to the main registry.

2 Likes

When the package manager - in it’s current state was released, with it’s features for maintaining multiple registries and the like there came talk of having organisation and topic specific registries. I thought it was really cool and jumped to be an early adopter as I could - but at the same time, searching and discovering julia packages in General became much easier and pleasant experience. Now it is how it is. It might be easier after all to just have general absorb it all, and have good tagging and instructions on the website on how to search the package list for biojulia packages.

5 Likes

I am a fan of this proposal.

I’ll be honest, I’ve always been sceptical of the BioJulia registry, and never saw enough benefit to move BioStructures.jl to it. Not that I think it’s bad, I just never saw the big advantage.

I think putting more on biojulia.net is essential, e.g. a list of packages and a couple of examples of how they can be used interoperably as part of an ecosystem.

I am also aware that this kind of work has fallen to a small number of people in the past, most notably Ben, so am happy to help as appropriate.

We have great tools here but the tools are better than the docs, discoverability and overall user experience.

4 Likes

Ok then, I’ll begin the transition.

Yeah it might be good to come up with notebooks and tutorials showing how you can piece together the ecosystem.

We should also perhaps have clear instructions on how to avoid the more common julia pitfalls, if we can stop new users from getting that holding the phone wrong response in the first place it might make a better first impression.

8 Likes

For what it’s worth, it was never the intention for there to be multiple public registries of open source projects. The purpose of supporting multiple registries was to allow fully private (not available to the public) and semi-private registries (available but meant for use within an organization/group). Having multiple public registries of open source projects just hurts their discoverability, IMO, as seems to be the case here. The requirements on the General registry are extremely minimal. Of course we can’t force people to not have separate registries, but it doesn’t seem like a good thing to me.

7 Likes

Ok, the process has begun with BioGenerics, BioSymbols, and FormatSpecimens but it’s 3am so I’m gonna crash. In the comming days, if anyone observing General, with the power to merge anything, notices that CI has lit green something that I’m migrating, that is sat waiting for the 3 day wait for new packages, that would help make it a lot less painful.

3 Likes

I’ll keep an eye on the registry.

2 Likes

Maybe I’m dense or there are complications related to registry CI but wouldn’t it be easier and better to do a wholesale merge of the BioJulia registry into General in a single PR? Assuming there are no name clashes, the package directories (with Package.toml, Versions.toml, Deps.toml, and Compat.toml) can just be copied in. The entries in the [packages] section of Registry.toml would need to be concatenated and sorted, which is easy enough. As a major benefit the historical versions of the packages would be retained.

Edit: Work in progress merge support in LocalRegistry: https://github.com/GunnarFarneback/LocalRegistry.jl/pull/14

1 Like

There are packages that have different versions that span both registries, and I don’t trust myself to manually sort that out in a PR without human error creeping in. But for the packages with no previous entries in General it would be easier.

Manually sounds barbaric but let there be tools. Try this and see if you like the result in /tmp/General (adjust the paths as necessary if you’re not running Linux):

using Pkg
pkg"add LocalRegistry#merge"
using LocalRegistry
run(`git clone https://github.com/JuliaRegistries/General.git /tmp/General`)
run(`git clone https://github.com/BioJulia/BioJuliaRegistry.git /tmp/BioJuliaRegistry`)
LocalRegistry.merge("/tmp/General", "/tmp/BioJuliaRegistry", merge_packages = true)
1 Like

I tried this this morning - I totally did not clock the merge_packages option facepalm.

It probably didn’t exist at the time. :slight_smile:

OMG that merge function has saved my sanity. Thank you!

2 Likes

Ok, everything in the BioJuliaRegistry has been merged into General. I’m going to update the website, but there may be references to the registry in READMEs and old doc builds whilst we get around to hunting them out and correcting them. In the meantime it would help if any users could remove the BioJuliaRegistry and try and do their thing with just General and see if it breaks - it shouldn’t but you know how it is. Soon-ish, say end of bank holiday, I will archive the BioJuliaRegistry.

10 Likes

New Getting Started instructions are available on the website, as is a new post explaining Bio.jl is deprecated, and which packages should be used instead.

2 Likes

What should I do to migrate from BioJuliaRegistry to General ?