Dear BioJulia users and stakeholders,
I’m pleased to announce a preview of the first stable release of Kmers.jl, namely version 1.0.
The release is essentially done and just needs some more testing and polishing, but before the release, I’d like feedback from potential users, and BioJulia stakeholders.
This includes the original author of Kmers.jl, @Ward9250 and @kevbonham.
I’m especially interested in feedback regarding broader design issues: The API, issues with user experience of the package, scope etc.
You can test it out by installing from the following branch: Breaking changes for v1 by jakobnissen · Pull Request #35 · BioJulia/Kmers.jl · GitHub . Note that you need to dev
a compatible version of BioSequences.jl which is not yet released, but which can be found on this branch: WIP: Kmers.jl compatibility by jakobnissen · Pull Request #282 · BioJulia/BioSequences.jl · GitHub
What is Kmers.jl?
Kmers.jl implements the Kmer
type - this is a subtype of BioSequence
which is immutable, a bitstype, and which has its length as a type parameter. These properties allows Kmer
s to be stored in registers, allowing for much more efficient code than the generic LongSequence
of BioSequences.jl.
As an analogy, if the types BioSequence
and LongSequence
from BioSequences.jl correspond to AbstractVector
and Vector
, then Kmer
is SVector
from StaticArrays.jl.
In bioinformatics, a kmer is a polymer molecule (typically DNA, RNA or peptide) consisting of exactly k
linked molecules.
In bioinformatics software, kmers are broadly used precisely for the performance characteristics of their implementation.
In Kmers.jl, performance is a top priority, and the methods are microoptimised to the best of my ability. For example, the following function creates a copy of the sequence that is reversed and complemented, and then picks the smaller of the two - it’s fully inlined and branchless:
julia> @code_native debuginfo=:none dump_module=false canonical(mer"UGCUGUA"r)
.text
push rbp
mov rbp, rsp
mov rax, qword ptr [r13 + 16]
mov rax, qword ptr [rax + 16]
mov rax, qword ptr [rax]
mov rcx, qword ptr [rdi]
mov eax, ecx
not eax
mov edx, eax
and eax, 13107
shr edx, 2
and edx, 819
lea rax, [rdx + 4*rax]
mov rdx, rax
shl eax, 4
shr rdx, 4
and eax, 61680
and edx, 3087
or rax, rdx
bswap rax
shr rax, 50
cmp rcx, rax
cmovb rax, rcx
pop rbp
ret
nop word ptr cs:[rax + rax]
Design decisions
- A
Kmer
is an ordinaryBioSequence
- it is constructed like, and behaves as any otherBioSequence
. - Performance is paramount. We’ll sacrifice precompilability, size of generated code, ease of use, and to some extent, latency, for performance.
- We are not going to implement different variations of the kmer concept, such as minimizers, k-min-mers and skipmers. These can be implemented in terms of the basic
Kmer
by users, if desired. - Kmers.jl are for high performance code, meaning it’s aimed at somewhat experienced Julia users, so Kmers.jl is designed less for beginner programmers. For most use cases, BioSequences.jl will be good enough.
A brief history of Kmers.jl
Before version 3 of BioSequences.jl was released about two years ago, BioSequences.jl contained a kmer type. However, we (Sabrina Ward and I) considered the old kmer type insufficient, as it had the following two limitations:
- It only supported the
Alphabets
DNAAlphabet{2}
andRNAAlphabet{2}
- It only supported lengths up to 32 (with a
BigMer
type supporting length 64), insufficient for many use cases
To solve these issues, we created two different, repositories with experimental implementations, before settling on kmers backed by bits packed into NTuple
s of integers.
We judged this new, more complex and specialized implementation should be moved out of BioSequences.jl for the breaking v3.0.0 release. When BioSequences.jl v3.0.0 was released in 2022, Kmers.jl was almost finished and would be released imminently… or so we thought.
In reality, development had stopped, and soon after, the author of Kmers, Sabrina Ward had to retreat from BioJulia development altogether which left Kmers.jl stillborn.
This placed BioJulia in the awkward situation of having removed its only kmer implementation in 2022 in a breaking change, with nothing to replace it with.
After having been busy with other BioJulia stuff, I’ve recently found time to finish up Kmers.jl. My plan is to release Kmers.jl 1.0 in a couple of months.