SequenceVariation: Brainstorming for new package

I’d like to make a new package for BioJulia. I’ve made a crude prototype, but I’m still unsure about the design and API and would like some feedback, and maybe pushback if my current design has problems. I think I asked about this on Slack maybe half a year ago, but it’s lost now, of course.

Anyway, the package should provide functionality for working with mutations in biological sequences. I would like to programmatically answer these two questions:

  • How does sequence A and B differ?
  • Is the mutation A297G present in sequence A?

These questions can already be answered by aligning sequences and staring at the result, but that is not very amenable to automation. What I want is basically a way to talk about differences between two biosequences.

Here’s the design I have in mind:

  • A variation is an Edit{S <: BioSequence, T <: BioSymbol}. An Edit contains a reference position, and a Union{Substitution{T}, Insertion{S}, Deletion}. This union is efficient due to union splitting.
  • The above implies that the package only deals with “edits”, i.e. those three operations. That annoys me, but I can’t think about how to make it more generic without making it terribly slow.
  • Substitution{T} just contains the sequence base, Deletion the length of the deletion, and Insertion{S} the whole inserted sequence.
  • A reference sequence ::S with an Edit{S, T} is a Variation{S, T}. Because an Edit is semantically meaningless without specifying a reference position, Edit will be considered an internal type, and the user is expected to work with Variations. The idea behind that is that by making it difficult to construct Edits without a reference sequence, it’s hard to mess up the reference positions, because you never have to manually manage it.
  • For convenience, a Variant{S, T} is a sequence ::S with a vector of Edit{S, T}.
  • Variation and Variant types are validating. That is, you cannot e.g. construct a Variant with a 10 nt DNA sequence and a mutation at position 11.

Here’s what you should be able to do:

  • Construct a Variant from two biosequences, one being the query and the other the reference
  • “Project” a Variant to a new reference. Under the hood, the two references are aligned and new reference positions are calculated.
  • Ask whether a Variation is contained in a Variant: “Does this sequence have this mutation?”.
  • Reconstruct the query, and/or an alignment from a Variant or a Variation

The actual use case for me is that for my work, I need to be able to automatically scan a large number of sequences for known mutations (and probably also do other shenanigans with the unknown mutations).

If you have more use cases or can think of something obvious that is missing, please tell me! Also, should this functionality be folded into GeneticVariation.jl? That is, is this out-of-scope, or would it be a welcome contribution?

Q&A

  • Why not just use CIGARs? They’re standardized, after all
    I was surprised to find that CIGARs are actually not standardized. Futher, they are alignment-centric and has operations like “clip”, which does not make sense in a biological/mutational context. Also, due to their alignment-centric nature, they have a “match/mismatch” operation, which does not distinguish the two, a must-have for working with alignments. Last, it should be possible to reconstruct a sequence given only the reference and the mutations, but that is not possible with CIGARs.

  • Shouldn’t it be extendable? What if someone else wants to look at differences in methylation patterns, not just sequence edits?
    Yeah, maybe it should be. Do you have a good idea how to do that?

2 Likes

This seems good to me. A few thoughts, in no particular order:

  1. Does a Deletion have a length (eg, if I’m missing 3 bases, is that one Deletion or 3)? This matters for things like cost models in an alignment, but it sounds like that’s not the point of this package, so keeping all edits length 1 makes sense to me (it just means the biological interpretation isn’t stored in the data structure)
  2. Do you intend to have a low-cost way of switching which sequence is the reference? Imagine I start with seqX = "ATTGCT" as the reference. Then I add seqY = "ATTCTT" - two substitutions. Then I add seqZ = "ATTATT, which is 2 substitutions from seqX but only 1 from seqY. It would most parsimonious to switch to seqY as the reference, but I’m not good enough at computer science to know how costly an operation like this would be. It seems like, with 2 sequences, it should be essentially free to swap which is the reference, but as the number of sequences increases, it could get complicated. But I would think it could be so doable without recalculating everything.
  3. Some of this send like it will depend on BioAlignments.jl as well - is that already a dependency of GeneticVariation.jl?

Thanks for your response!

  1. Yep, right now Deletion has a length. I’m not settled on any particular underlying implementation, it just seemed easier to have a single deletion of length N rather than N deletions of length 1.
  2. Yes, as efficient as I can figure out how to do it. Switching from one reference to another, I imagine it will be very easy. With N references of length M, I think it needs O(MN) time and O(M) space to find the optimal reference and pick that. That can be done by aligning the sequence to each of the potential references and picking the best candidate.
  3. I do intend to make BioAlignments a dependency. Finding the optimal reference, and constructing a Variant from a seq and a reference will be done on the basis of an alignment, I imagine. One can then choose to use the different alignment algorithms BioAlignments provide. Perhaps, if I can figure out how to, I can improve the edit distance alignment to be more efficient (for example, not allocating the full tracing matrix).
1 Like

I love this idea. Your use case sounds very similar to mine. Some ‘laundry list’ items that I would like to see in such a package

  1. Ability to create a Variant from an already existing AlignedSequence
    This is theoretically possible, but non-standard CIGARS (as you mentioned) might throw a kink in it.
  2. Ability to create Variation objects from a VCF/BCF file
    Lends itself to the notion that this functionality should go into GeneticVariation.jl. Would, of course, require a reference sequence.

Question: what happens with query sequences much shorter than the reference? I’m thinking about reads contained in a SAM/BAM file, and would like to query if each of them have a Variant present, not just a full assembly/consensus sequence.

I think this functionality does fit within the scope of GeneticVariation.jl. I can easily see a function like gene_frequencies being extended with a method like gene_frequencies(::Variant, ::AbstractArray{Sequence}). (Ok, please don’t code it quite like that, but you get the point.) My only qualm with wrapping it into GeneticVariation.jl right now is how outdated GeneticVariation.jl is, to the point where I haven’t been able to install it alongside XAM.jl (issue here), and have never used it.

1 Like