Optimizing dinucleotides count in a DNA sequence type `LongDNA`

Dan · July 15, 2023, 9:00pm

This looks like a problem which would use SIMD for an optimal solution. Perhaps the only way to get Julia to use SIMD effectively in this problem is hand-crafted code.

rocco_sprmnt21 · July 17, 2023, 5:33pm

Trying to understand how some of the internals of the decode and encode functions work, I discovered the ‘data’ property of DNA sequences.
Using this property, execution times can be shortened.
julia> using BioSequences, BenchmarkTools,StatsBase

julia> rseq = randdnaseq(10^6)
1000000nt DNA Sequence:
AACATCACCCGTGCCAATTACACGTCGTCCTTATTATAA…GAAAGCCATGCAGAAATTTTACACAGATCAAACTTCTCT

julia> function countdin(rseq)
    h1=countmap(reinterpret(reshape,UInt8,rseq.data))
    lseq=LongSequence{DNAAlphabet{4}}(undef, rseq.len-2)
    copyto!(lseq, 1, rseq, 2, rseq.len-2)
    h2=countmap(reinterpret(reshape,UInt8,lseq.data))
    fdict=mergewith(+,h1,h2)
    res=Dict{Tuple{DNA,DNA},Int}()
    for x in keys(fdict)
        c=x==0 ? continue : reinterpret.(DNA, ((x<<4)>>4, x>>4))  
        res[c]=fdict[x]
    end
    res
end
countdin (generic function with 1 method)

julia> @btime countdin(rseq)
  986.100 μs (30 allocations: 497.28 KiB)
Dict{Tuple{DNA, DNA}, Int64} with 16 entries:
  (DNA_G, DNA_T) => 62503
  (DNA_T, DNA_A) => 62252
  (DNA_G, DNA_C) => 62576
  (DNA_C, DNA_T) => 62250
  (DNA_A, DNA_A) => 62555
  (DNA_C, DNA_C) => 62272
  (DNA_G, DNA_G) => 62520
  (DNA_C, DNA_G) => 62609
  (DNA_T, DNA_T) => 62469
  (DNA_T, DNA_C) => 62567
  (DNA_A, DNA_T) => 62622
  (DNA_A, DNA_C) => 62414
  (DNA_T, DNA_G) => 62555
  ⋮              => ⋮

Topic		Replies	Views
Optimizing performance with FASTX I/O stream and Codon Counting Biology, Health, and Medicine question	5	810	April 26, 2023
How to improve a Generator to be more memory efficient when it is collected? Performance question , regex , optimization , iterators	21	966	January 23, 2023
BioJulia Fastx import long{4} Biology, Health, and Medicine biology	1	493	March 14, 2023
[BioSequences] data structure to keep mutations (delta) of a sequence Biology, Health, and Medicine	4	311	March 7, 2024
[ANN] Nucleotide_Essentials.jl - Support for some basic first steps in analyzing Illumina sequencing data! Package Announcements package , announcement , biology	3	422	April 15, 2022

Optimizing dinucleotides count in a DNA sequence type `LongDNA`

Related topics