This looks like a problem which would use SIMD for an optimal solution. Perhaps the only way to get Julia to use SIMD effectively in this problem is hand-crafted code.
1 Like
Trying to understand how some of the internals of the decode and encode functions work, I discovered the ādataā property of DNA sequences.
Using this property, execution times can be shortened.
julia> using BioSequences, BenchmarkTools,StatsBase
julia> rseq = randdnaseq(10^6)
1000000nt DNA Sequence:
AACATCACCCGTGCCAATTACACGTCGTCCTTATTATAAā¦GAAAGCCATGCAGAAATTTTACACAGATCAAACTTCTCT
julia> function countdin(rseq)
h1=countmap(reinterpret(reshape,UInt8,rseq.data))
lseq=LongSequence{DNAAlphabet{4}}(undef, rseq.len-2)
copyto!(lseq, 1, rseq, 2, rseq.len-2)
h2=countmap(reinterpret(reshape,UInt8,lseq.data))
fdict=mergewith(+,h1,h2)
res=Dict{Tuple{DNA,DNA},Int}()
for x in keys(fdict)
c=x==0 ? continue : reinterpret.(DNA, ((x<<4)>>4, x>>4))
res[c]=fdict[x]
end
res
end
countdin (generic function with 1 method)
julia> @btime countdin(rseq)
986.100 Ī¼s (30 allocations: 497.28 KiB)
Dict{Tuple{DNA, DNA}, Int64} with 16 entries:
(DNA_G, DNA_T) => 62503
(DNA_T, DNA_A) => 62252
(DNA_G, DNA_C) => 62576
(DNA_C, DNA_T) => 62250
(DNA_A, DNA_A) => 62555
(DNA_C, DNA_C) => 62272
(DNA_G, DNA_G) => 62520
(DNA_C, DNA_G) => 62609
(DNA_T, DNA_T) => 62469
(DNA_T, DNA_C) => 62567
(DNA_A, DNA_T) => 62622
(DNA_A, DNA_C) => 62414
(DNA_T, DNA_G) => 62555
ā® => ā®