I am working with the BioSequences.jl
package, particularly a LongSequence{DNAAlphabet{4}}
(a very efficient string) type like the following random DNA sequence seq = dna"ACACATCATCTACTATCAT"
and I want a function that counts the instances of dinucleotides so that I get the following output:
dinucleotides(seq)
Dict{LongSequence{DNAAlphabet{4}}, Int64} with 16 entries:
GG => 0
AC => 3
GC => 0
CG => 0
CC => 0
TG => 0
TC => 3
AT => 4
GT => 0
AA => 0
GA => 0
CT => 2
CA => 4
TT => 0
TA => 2
AG => 0
I have implemented a method that in BioMarkovChains.jl.
function dinucleotides(sequence::LongNucOrView{4}; extended_alphabet::Bool = false)
A = extended_alphabet ? collect(alphabet(DNA)) : [DNA_A, DNA_C, DNA_G, DNA_T]
dinucleotides = vec([LongSequence{DNAAlphabet{4}}([n1, n2]) for n1 in A, n2 in A])
# counts = zeros(Int64, length(dinucleotides))
counts = Array{Int64,1}(undef, 64)
for (index, pair) in enumerate(dinucleotides)
count = 0
for i in 1:length(sequence)-1
if sequence[i] == pair[1] && sequence[i+1] == pair[2]
count += 1
end
end
counts[index] = count
end
pairsdict = Dict{LongSequence{DNAAlphabet{4}},Int64}()
for (index, pair) in enumerate(dinucleotides)
pairsdict[pair] = counts[index]
end
return pairsdict
end
This method is moderately fast, but allocates a lot when the sequence begin to grow.
There is an efficient method count
for Strings
that will count patterns (of different lengths in a larger sequence the official example is:
count('a', "JuliaLang")
2
There is also a nice implementation to count individual elements of different structs in the StatsBase
std library that is very efficient for any seq:
countmap(seq)
Dict{DNA, Int64} with 3 entries:
DNA_C => 6
DNA_T => 6
DNA_A => 7
I know this is actually k-mer
problem, but wondering if it could be solved for the dinucleotide 2-mer
case simply?