Efficient memory representation of sequences with few mutations

e3c6 · April 24, 2019, 10:08am

I have a dataset which consists of a wild-type sequence and many (\sim10^7 or more) mutated sequences which differ from the wild-type in only a few positions. I want to represent this efficiently in memory, by storing the wild-type sequence only once and then storing only the mutations of each mutated sequence.

Does BioSequences offer some facility for this kind of representation?
Otherwise, any advice on how I might approach this?

Tamas_Papp · April 24, 2019, 11:19am

A wrapper <: AbstractVector type that stores the indexes and values for the differences, and has a Base.getindex that looks up if the index is in the changed set or not, and returns the value accordingly.

The best representation depends on details, eg a sorted vector, or a Dict of indexes.

e3c6 · April 24, 2019, 11:47am

I thought of something like this.

Change = NamedTuple{(:pos,:x), Tuple{Int,Int}}
struct DiffSeq{L,D}
	s0::NTuple{L,Int}
	dif::NTuple{D, Change}
end

And then defined the iterator interface accordingly.
Here s0 is the wild-type sequence, and it is shared across may DiffSeq instances. However this only makes sense if s0 is only allocated in memory once.

The documentation says that (...) in some cases the compiler is able to avoid allocating immutable objects entirely. In this case s0 is an NTuple which is immutable. But how can I be sure it is not being allocated many times?

Ward9250 · April 24, 2019, 1:55pm

I don’t think you can. Immutable types in Julia are defined by their value, and not by their memory address as mutable types are. If s0 was a vector and not a tuple, then that s0 vector is passed by reference (defined by it’s memory address), and so if you pass a vector to many DiffSeqs as their s0, you know all those DiffSeqs will share that very same vector. Which you can’t know AFAIK if you use a NTuple.

Ward9250 · April 24, 2019, 1:59pm

You could also check out BioAlignments.jl, which has a datastruture which represents one sequence (an aligned sequence) as a reference to some reference sequence, and a set of edit operations (match, mismatch, indel and so on).

e3c6 · April 24, 2019, 2:02pm

I think this is just what I was looking for! Thanks.

Edit: On second thought, I don’t see how this helps me. It doesn’t seem to be able to convert the AlignedSequence to a Sequence. In that case I might as well just use the struct I created above.

bicycle1885 · April 24, 2019, 2:19pm

I think the best data structure highly depends on the operations you want to perform. AlignedSequence would be possible to represent such mutated sequences, but it is not clear whether it meets your needs because we don’t know what you really want to do on them.

Tamas_Papp · April 24, 2019, 2:24pm

I usually just trust the compiler on this, its heuristics are quite good.

I am not sure NTuple is a good type for a long sequence though. Just use a vector. Or leave it as a type parameter so it works with any kind of <:AbstractVector.

e3c6 · April 24, 2019, 2:27pm

I want to be able to access the sequence letters via getindex, but especially iterate through them as fast as possible.

Ward9250 · May 1, 2019, 11:04am

You should be able to convert between the two, by iterating over every element of the aligned sequence and use that to build a new sequence. Something like DNASequence(collect(my_aligned_sequence))

Topic		Replies	Views
[BioSequences] data structure to keep mutations (delta) of a sequence Biology, Health, and Medicine	4	299	March 7, 2024
Clarification about memory management of immutable and mutable struct Internals & Design data_structures	12	5078	November 14, 2019
Struct with immutable arrays New to Julia	2	843	March 24, 2020
Performance: mutable vs immutable structs Performance performance , memory , memory-allocation	4	6690	October 4, 2017
Making an immutable array type? Internals & Design	19	6089	November 16, 2019

Efficient memory representation of sequences with few mutations

Related topics