I’d like to make a new package for BioJulia. I’ve made a crude prototype, but I’m still unsure about the design and API and would like some feedback, and maybe pushback if my current design has problems. I think I asked about this on Slack maybe half a year ago, but it’s lost now, of course.
Anyway, the package should provide functionality for working with mutations in biological sequences. I would like to programmatically answer these two questions:
- How does sequence A and B differ?
- Is the mutation
A297Gpresent in sequence A?
These questions can already be answered by aligning sequences and staring at the result, but that is not very amenable to automation. What I want is basically a way to talk about differences between two biosequences.
Here’s the design I have in mind:
- A variation is an
Edit{S <: BioSequence, T <: BioSymbol}. AnEditcontains a reference position, and aUnion{Substitution{T}, Insertion{S}, Deletion}. This union is efficient due to union splitting. - The above implies that the package only deals with “edits”, i.e. those three operations. That annoys me, but I can’t think about how to make it more generic without making it terribly slow.
-
Substitution{T}just contains the sequence base,Deletionthe length of the deletion, andInsertion{S}the whole inserted sequence. - A reference sequence
::Swith anEdit{S, T}is aVariation{S, T}. Because anEditis semantically meaningless without specifying a reference position,Editwill be considered an internal type, and the user is expected to work withVariations. The idea behind that is that by making it difficult to constructEdits without a reference sequence, it’s hard to mess up the reference positions, because you never have to manually manage it. - For convenience, a
Variant{S, T}is a sequence::Swith a vector ofEdit{S, T}. -
VariationandVarianttypes are validating. That is, you cannot e.g. construct aVariantwith a 10 nt DNA sequence and a mutation at position 11.
Here’s what you should be able to do:
- Construct a
Variantfrom two biosequences, one being the query and the other the reference - “Project” a
Variantto a new reference. Under the hood, the two references are aligned and new reference positions are calculated. - Ask whether a
Variationis contained in aVariant: “Does this sequence have this mutation?”. - Reconstruct the query, and/or an alignment from a
Variantor aVariation
The actual use case for me is that for my work, I need to be able to automatically scan a large number of sequences for known mutations (and probably also do other shenanigans with the unknown mutations).
If you have more use cases or can think of something obvious that is missing, please tell me! Also, should this functionality be folded into GeneticVariation.jl? That is, is this out-of-scope, or would it be a welcome contribution?
Q&A
-
Why not just use CIGARs? They’re standardized, after all
I was surprised to find that CIGARs are actually not standardized. Futher, they are alignment-centric and has operations like “clip”, which does not make sense in a biological/mutational context. Also, due to their alignment-centric nature, they have a “match/mismatch” operation, which does not distinguish the two, a must-have for working with alignments. Last, it should be possible to reconstruct a sequence given only the reference and the mutations, but that is not possible with CIGARs. -
Shouldn’t it be extendable? What if someone else wants to look at differences in methylation patterns, not just sequence edits?
Yeah, maybe it should be. Do you have a good idea how to do that?