This seems good to me. A few thoughts, in no particular order:
- Does a
Deletionhave a length (eg, if I’m missing 3 bases, is that oneDeletionor 3)? This matters for things like cost models in an alignment, but it sounds like that’s not the point of this package, so keeping all edits length 1 makes sense to me (it just means the biological interpretation isn’t stored in the data structure) - Do you intend to have a low-cost way of switching which sequence is the reference? Imagine I start with
seqX = "ATTGCT"as the reference. Then I addseqY = "ATTCTT"- two substitutions. Then I addseqZ = "ATTATT, which is 2 substitutions fromseqXbut only 1 fromseqY. It would most parsimonious to switch toseqYas the reference, but I’m not good enough at computer science to know how costly an operation like this would be. It seems like, with 2 sequences, it should be essentially free to swap which is the reference, but as the number of sequences increases, it could get complicated. But I would think it could be so doable without recalculating everything. - Some of this send like it will depend on BioAlignments.jl as well - is that already a dependency of GeneticVariation.jl?