Optimizing dinucleotides count in a DNA sequence type `LongDNA`

This looks like a problem which would use SIMD for an optimal solution. Perhaps the only way to get Julia to use SIMD effectively in this problem is hand-crafted code.

1 Like

Trying to understand how some of the internals of the decode and encode functions work, I discovered the ā€˜dataā€™ property of DNA sequences.
Using this property, execution times can be shortened.
julia> using BioSequences, BenchmarkTools,StatsBase

julia> rseq = randdnaseq(10^6)
1000000nt DNA Sequence:
AACATCACCCGTGCCAATTACACGTCGTCCTTATTATAAā€¦GAAAGCCATGCAGAAATTTTACACAGATCAAACTTCTCT

julia> function countdin(rseq)
    h1=countmap(reinterpret(reshape,UInt8,rseq.data))
    lseq=LongSequence{DNAAlphabet{4}}(undef, rseq.len-2)
    copyto!(lseq, 1, rseq, 2, rseq.len-2)
    h2=countmap(reinterpret(reshape,UInt8,lseq.data))
    fdict=mergewith(+,h1,h2)
    res=Dict{Tuple{DNA,DNA},Int}()
    for x in keys(fdict)
        c=x==0 ? continue : reinterpret.(DNA, ((x<<4)>>4, x>>4))  
        res[c]=fdict[x]
    end
    res
end
countdin (generic function with 1 method)

julia> @btime countdin(rseq)
  986.100 Ī¼s (30 allocations: 497.28 KiB)
Dict{Tuple{DNA, DNA}, Int64} with 16 entries:
  (DNA_G, DNA_T) => 62503
  (DNA_T, DNA_A) => 62252
  (DNA_G, DNA_C) => 62576
  (DNA_C, DNA_T) => 62250
  (DNA_A, DNA_A) => 62555
  (DNA_C, DNA_C) => 62272
  (DNA_G, DNA_G) => 62520
  (DNA_C, DNA_G) => 62609
  (DNA_T, DNA_T) => 62469
  (DNA_T, DNA_C) => 62567
  (DNA_A, DNA_T) => 62622
  (DNA_A, DNA_C) => 62414
  (DNA_T, DNA_G) => 62555
  ā‹®              => ā‹®