Request for comments: Upcoming Kmers.jl version 1.0

Dear BioJulia users and stakeholders,

I’m pleased to announce a preview of the first stable release of Kmers.jl, namely version 1.0.
The release is essentially done and just needs some more testing and polishing, but before the release, I’d like feedback from potential users, and BioJulia stakeholders.
This includes the original author of Kmers.jl, @Ward9250 and @kevbonham.

I’m especially interested in feedback regarding broader design issues: The API, issues with user experience of the package, scope etc.
You can test it out by installing from the following branch: Breaking changes for v1 by jakobnissen · Pull Request #35 · BioJulia/Kmers.jl · GitHub . Note that you need to dev a compatible version of BioSequences.jl which is not yet released, but which can be found on this branch: WIP: Kmers.jl compatibility by jakobnissen · Pull Request #282 · BioJulia/BioSequences.jl · GitHub

What is Kmers.jl?

Kmers.jl implements the Kmer type - this is a subtype of BioSequence which is immutable, a bitstype, and which has its length as a type parameter. These properties allows Kmers to be stored in registers, allowing for much more efficient code than the generic LongSequence of BioSequences.jl.

As an analogy, if the types BioSequence and LongSequence from BioSequences.jl correspond to AbstractVector and Vector, then Kmer is SVector from StaticArrays.jl.

In bioinformatics, a kmer is a polymer molecule (typically DNA, RNA or peptide) consisting of exactly k linked molecules.
In bioinformatics software, kmers are broadly used precisely for the performance characteristics of their implementation.

In Kmers.jl, performance is a top priority, and the methods are microoptimised to the best of my ability. For example, the following function creates a copy of the sequence that is reversed and complemented, and then picks the smaller of the two - it’s fully inlined and branchless:

julia> @code_native debuginfo=:none dump_module=false canonical(mer"UGCUGUA"r)
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        mov     rcx, qword ptr [rdi]
        mov     eax, ecx
        not     eax
        mov     edx, eax
        and     eax, 13107
        shr     edx, 2
        and     edx, 819
        lea     rax, [rdx + 4*rax]
        mov     rdx, rax
        shl     eax, 4
        shr     rdx, 4
        and     eax, 61680
        and     edx, 3087
        or      rax, rdx
        bswap   rax
        shr     rax, 50
        cmp     rcx, rax
        cmovb   rax, rcx
        pop     rbp
        ret
        nop     word ptr cs:[rax + rax]

Design decisions

  • A Kmer is an ordinary BioSequence - it is constructed like, and behaves as any other BioSequence.
  • Performance is paramount. We’ll sacrifice precompilability, size of generated code, ease of use, and to some extent, latency, for performance.
  • We are not going to implement different variations of the kmer concept, such as minimizers, k-min-mers and skipmers. These can be implemented in terms of the basic Kmer by users, if desired.
  • Kmers.jl are for high performance code, meaning it’s aimed at somewhat experienced Julia users, so Kmers.jl is designed less for beginner programmers. For most use cases, BioSequences.jl will be good enough.

A brief history of Kmers.jl

Before version 3 of BioSequences.jl was released about two years ago, BioSequences.jl contained a kmer type. However, we (Sabrina Ward and I) considered the old kmer type insufficient, as it had the following two limitations:

  • It only supported the Alphabets DNAAlphabet{2} and RNAAlphabet{2}
  • It only supported lengths up to 32 (with a BigMer type supporting length 64), insufficient for many use cases

To solve these issues, we created two different, repositories with experimental implementations, before settling on kmers backed by bits packed into NTuples of integers.

We judged this new, more complex and specialized implementation should be moved out of BioSequences.jl for the breaking v3.0.0 release. When BioSequences.jl v3.0.0 was released in 2022, Kmers.jl was almost finished and would be released imminently… or so we thought.

In reality, development had stopped, and soon after, the author of Kmers, Sabrina Ward had to retreat from BioJulia development altogether which left Kmers.jl stillborn.
This placed BioJulia in the awkward situation of having removed its only kmer implementation in 2022 in a breaking change, with nothing to replace it with.

After having been busy with other BioJulia stuff, I’ve recently found time to finish up Kmers.jl. My plan is to release Kmers.jl 1.0 in a couple of months.

18 Likes

Huge thanks for this effort, and for the detailed history and explanation! I will happily give it a test drive!

Happy New Year :sparkler:

2 Likes

Thank you @jakobnissen and all that put effort into this project. I think this package is a cornerstone of other bioinfo packages. I will take a look and be happy to give some feedback. Best.

1 Like