Efficient memory representation of sequences with few mutations

I have a dataset which consists of a wild-type sequence and many (\sim10^7 or more) mutated sequences which differ from the wild-type in only a few positions. I want to represent this efficiently in memory, by storing the wild-type sequence only once and then storing only the mutations of each mutated sequence.

Does BioSequences offer some facility for this kind of representation?
Otherwise, any advice on how I might approach this?

A wrapper <: AbstractVector type that stores the indexes and values for the differences, and has a Base.getindex that looks up if the index is in the changed set or not, and returns the value accordingly.

The best representation depends on details, eg a sorted vector, or a Dict of indexes.

1 Like

I thought of something like this.

Change = NamedTuple{(:pos,:x), Tuple{Int,Int}}
struct DiffSeq{L,D}
	s0::NTuple{L,Int}
	dif::NTuple{D, Change}
end

And then defined the iterator interface accordingly.
Here s0 is the wild-type sequence, and it is shared across may DiffSeq instances. However this only makes sense if s0 is only allocated in memory once.

The documentation says that (...) in some cases the compiler is able to avoid allocating immutable objects entirely. In this case s0 is an NTuple which is immutable. But how can I be sure it is not being allocated many times?

I don’t think you can. Immutable types in Julia are defined by their value, and not by their memory address as mutable types are. If s0 was a vector and not a tuple, then that s0 vector is passed by reference (defined by it’s memory address), and so if you pass a vector to many DiffSeqs as their s0, you know all those DiffSeqs will share that very same vector. Which you can’t know AFAIK if you use a NTuple.

You could also check out BioAlignments.jl, which has a datastruture which represents one sequence (an aligned sequence) as a reference to some reference sequence, and a set of edit operations (match, mismatch, indel and so on).

3 Likes

I think this is just what I was looking for! Thanks.

Edit: On second thought, I don’t see how this helps me. It doesn’t seem to be able to convert the AlignedSequence to a Sequence. In that case I might as well just use the struct I created above.

I think the best data structure highly depends on the operations you want to perform. AlignedSequence would be possible to represent such mutated sequences, but it is not clear whether it meets your needs because we don’t know what you really want to do on them.

1 Like

I usually just trust the compiler on this, its heuristics are quite good.

I am not sure NTuple is a good type for a long sequence though. Just use a vector. Or leave it as a type parameter so it works with any kind of <:AbstractVector.

1 Like

I want to be able to access the sequence letters via getindex, but especially iterate through them as fast as possible.

You should be able to convert between the two, by iterating over every element of the aligned sequence and use that to build a new sequence. Something like DNASequence(collect(my_aligned_sequence))