I’d like to make a new package for BioJulia. I’ve made a crude prototype, but I’m still unsure about the design and API and would like some feedback, and maybe pushback if my current design has problems. I think I asked about this on Slack maybe half a year ago, but it’s lost now, of course.
Anyway, the package should provide functionality for working with mutations in biological sequences. I would like to programmatically answer these two questions:
- How does sequence A and B differ?
- Is the mutation
A297G
present in sequence A?
These questions can already be answered by aligning sequences and staring at the result, but that is not very amenable to automation. What I want is basically a way to talk about differences between two biosequences.
Here’s the design I have in mind:
- A variation is an
Edit{S <: BioSequence, T <: BioSymbol}
. AnEdit
contains a reference position, and aUnion{Substitution{T}, Insertion{S}, Deletion}
. This union is efficient due to union splitting. - The above implies that the package only deals with “edits”, i.e. those three operations. That annoys me, but I can’t think about how to make it more generic without making it terribly slow.
-
Substitution{T}
just contains the sequence base,Deletion
the length of the deletion, andInsertion{S}
the whole inserted sequence. - A reference sequence
::S
with anEdit{S, T}
is aVariation{S, T}
. Because anEdit
is semantically meaningless without specifying a reference position,Edit
will be considered an internal type, and the user is expected to work withVariations
. The idea behind that is that by making it difficult to constructEdit
s without a reference sequence, it’s hard to mess up the reference positions, because you never have to manually manage it. - For convenience, a
Variant{S, T}
is a sequence::S
with a vector ofEdit{S, T}
. -
Variation
andVariant
types are validating. That is, you cannot e.g. construct aVariant
with a 10 nt DNA sequence and a mutation at position 11.
Here’s what you should be able to do:
- Construct a
Variant
from two biosequences, one being the query and the other the reference - “Project” a
Variant
to a new reference. Under the hood, the two references are aligned and new reference positions are calculated. - Ask whether a
Variation
is contained in aVariant
: “Does this sequence have this mutation?”. - Reconstruct the query, and/or an alignment from a
Variant
or aVariation
The actual use case for me is that for my work, I need to be able to automatically scan a large number of sequences for known mutations (and probably also do other shenanigans with the unknown mutations).
If you have more use cases or can think of something obvious that is missing, please tell me! Also, should this functionality be folded into GeneticVariation.jl? That is, is this out-of-scope, or would it be a welcome contribution?
Q&A
-
Why not just use CIGARs? They’re standardized, after all
I was surprised to find that CIGARs are actually not standardized. Futher, they are alignment-centric and has operations like “clip”, which does not make sense in a biological/mutational context. Also, due to their alignment-centric nature, they have a “match/mismatch” operation, which does not distinguish the two, a must-have for working with alignments. Last, it should be possible to reconstruct a sequence given only the reference and the mutations, but that is not possible with CIGARs. -
Shouldn’t it be extendable? What if someone else wants to look at differences in methylation patterns, not just sequence edits?
Yeah, maybe it should be. Do you have a good idea how to do that?