Optimizing dinucleotides count in a DNA sequence type `LongDNA`

camilogarciabotero · July 13, 2023, 4:51pm

I am working with the BioSequences.jl package, particularly a LongSequence{DNAAlphabet{4}} (a very efficient string) type like the following random DNA sequence seq = dna"ACACATCATCTACTATCAT" and I want a function that counts the instances of dinucleotides so that I get the following output:

dinucleotides(seq)

Dict{LongSequence{DNAAlphabet{4}}, Int64} with 16 entries:
  GG => 0
  AC => 3
  GC => 0
  CG => 0
  CC => 0
  TG => 0
  TC => 3
  AT => 4
  GT => 0
  AA => 0
  GA => 0
  CT => 2
  CA => 4
  TT => 0
  TA => 2
  AG => 0

I have implemented a method that in BioMarkovChains.jl.

function dinucleotides(sequence::LongNucOrView{4}; extended_alphabet::Bool = false)
    A = extended_alphabet ? collect(alphabet(DNA)) : [DNA_A, DNA_C, DNA_G, DNA_T]
    dinucleotides = vec([LongSequence{DNAAlphabet{4}}([n1, n2]) for n1 in A, n2 in A])

    # counts = zeros(Int64, length(dinucleotides))
    counts = Array{Int64,1}(undef, 64)
    for (index, pair) in enumerate(dinucleotides)
        count = 0
        for i in 1:length(sequence)-1
            if sequence[i] == pair[1] && sequence[i+1] == pair[2]
                count += 1
            end
        end
        counts[index] = count
    end

    pairsdict = Dict{LongSequence{DNAAlphabet{4}},Int64}()
    for (index, pair) in enumerate(dinucleotides)
        pairsdict[pair] = counts[index]
    end

    return pairsdict
end

This method is moderately fast, but allocates a lot when the sequence begin to grow.

There is an efficient method count for Strings that will count patterns (of different lengths in a larger sequence the official example is:

count('a', "JuliaLang")

2

There is also a nice implementation to count individual elements of different structs in the StatsBase std library that is very efficient for any seq:

countmap(seq)

Dict{DNA, Int64} with 3 entries:
  DNA_C => 6
  DNA_T => 6
  DNA_A => 7

I know this is actually k-mer problem, but wondering if it could be solved for the dinucleotide 2-mer case simply?

gdalle · July 13, 2023, 6:13pm

Have you tried countmap(zip(seq[begin:end-1], seq[begin+1:end])), or the same with count? I don’t know anything about BioJulia but that’s the simplest thing I could think of.

camilogarciabotero · July 13, 2023, 6:25pm

It actually works (with a slighly different, but managable output), yet it allocates a little bit more than dinucleotides:

rseq = randdnaseq(10^6)


@btime dinucelotides(rseq) 
55.313 ms (40 allocations: 3.45 KiB)

Dict{LongSequence{DNAAlphabet{4}}, Int64} with 16 entries:
  GG => 62176
   .
   .
   .

@btime
countmap(zip(rseq[begin:end-1], rseq[begin+1:end]))
  11.391 ms (19 allocations: 978.20 KiB)

Dict{Tuple{DNA, DNA}, Int64} with 16 entries:
  (DNA_G, DNA_A) => 62617
        .
        .
        .

gdalle · July 13, 2023, 6:29pm

It allocates a lot because it does two copies of the list when taking slices. I wrote it that way for simplicity, but a better way to write it would be the following:

a = @view seq[begin:end-1]
b = @view seq[begin+1:end]
countmap(zip(a, b))

This should not allocate as much.

gdalle · July 13, 2023, 6:29pm

Also, when you use BenchmarkTools.jl, don’t forget to put a $ in front of any global variables passed to your functions, for unbiased benchmarks

camilogarciabotero · July 13, 2023, 9:47pm

Nice! This is a very fast alternative and allocates just ~1 Kb!

gdalle · July 14, 2023, 6:34am

Yeah you can’t escape the allocation of the count dictionary itself

rocco_sprmnt21 · July 14, 2023, 6:59am


using BioSequences, StatsBase, IterTools, BenchmarkTools

rseq = randdnaseq(10^6);

@btime countmap(partition(rseq,2,1)); 
# 286.886 ms (6000004 allocations: 198.37 MiB)

@btime countmap(zip(rseq[begin:end-1], rseq[begin+1:end]));
 #  15.481 ms (19 allocations: 978.20 KiB)


function zipdna(rseq)
    D=Dict(((l,r) for l in dna"ACGT", r in dna"ACGT").=>0)
    for (l,r) in  zip(view(rseq,1:10^6-1), view(rseq,2:10^6))
        D[(l,r)]+=1 
    end
    D
end

@btime zipdna(rseq)
# 23.802 ms (11 allocations: 1.81 KiB)

function getdna(rseq)
dd=Dict{Tuple{DNA, DNA}, Int64}()
for  i in 1:10^6-1 
    (l,r)= (rseq[i],rseq[i+1])
    #dd[(l,r)]=get(dd, (l,r), 0) + 1
    dd[(l,r)]=get!(()->0,dd,(l,r))+1
end
dd
end

@btime getdna(rseq); 
#  24.949 ms (7 allocations: 1.28 KiB)

@btime begin
    a = @view rseq[begin:end-1]
    b = @view rseq[begin+1:end]
    countmap(zip(a, b))
end 
# 14.269 ms (15 allocations: 1.52 KiB)

Some questions about the measurements obtained with the previous functions.
Why does the function with partition make so many allocations?
I tried to simulate in two slightly different ways what countmap does.
Again what can be “improved” to (at least) equalize the countmap result?
Does anyone know if, internally, it uses a particular algorithm that is more performant?

given this

julia> @btime partition($rseq,2,1);
  1.500 ns (0 allocations: 0 bytes)

julia> @btime zip((@view $rseq[begin:end-1]), @view $rseq[begin+1:end]);
  4.700 ns (0 allocations: 0 bytes)

why partition has such a worse result than zip()?

function zipdna(rseq)
    D=Dict(((l,r) for l in dna"ACGT", r in dna"ACGT").=>0)
    for (l,r) in  zip(view(rseq,1:10^6-1), view(rseq,2:10^6))
        D[(l,r)]+=1 
    end
    D
end

@btime zipdna(rseq)
# 23.802 ms (11 allocations: 1.81 KiB)

function getpartdna(rseq)
    D=Dict(((l,r) for l in dna"ACGT", r in dna"ACGT").=>0)
    for  (l,r) in partition(rseq,2,1)
        D[(l,r)]+=1 
    end
    D
end
@btime getpartdna($rseq)
# 252.704 ms (6000006 allocations: 198.37 MiB)

rocco_sprmnt21 · July 14, 2023, 9:44am

squeezing a little more…

julia> @btime begin
           #a = @view rseq[begin:end-1]
           b = @view rseq[begin+1:end]
           countmap(zip(rseq, b))
       end
  10.514 ms (11 allocations: 1.41 KiB)

gdalle · July 14, 2023, 10:03am

Right, forgot that zip stops at the shortest of the two

rocco_sprmnt21 · July 14, 2023, 1:48pm

updating IterTools to the master you get the following result


julia> @btime countmap(zip(rseq,@view rseq[begin+1:end]));
  10.757 ms (11 allocations: 1.41 KiB)

julia> @btime countmap(partition(rseq,2,1));
  9.726 ms (9 allocations: 1.33 KiB)

if a tighter (5X) time was needed, something like this could be done


julia> rseq=rand(codeunits("ACGT"),10^6)
1000000-element Vector{UInt8}:
 0x41
 0x41
 0x41
 0x47
 0x54
 0x54
 0x43
 0x41
 0x41
    ⋮
 0x41
 0x54
 0x43
 0x47
 0x41
 0x41
 0x54
 0x47

julia>  @btime countmap(reinterpret(UInt16,@view repeat($rseq,inner=2)[2:end-1]))
  1.731 ms (11 allocations: 2.41 MiB)
Dict{UInt16, Int64} with 16 entries:
  0x4341 => 62529
  0x5441 => 62362
  0x4754 => 62289
  0x4743 => 62460
  0x5447 => 62838
  0x4354 => 62691
  0x4143 => 62459
  0x4747 => 62239
  0x4343 => 62486
  0x4347 => 62278
  0x5454 => 62697
  0x4141 => 62259
  0x4154 => 62799
  0x5443 => 62579
  0x4147 => 62333
  0x4741 => 62701

julia> @btime join.(partition(Char.(reinterpret(UInt8,Base.vect(keys(cm)...))),2,2)).=>values(cm)
  3.950 μs (112 allocations: 4.73 KiB)
16-element Vector{Pair{String, Int64}}:
 "AC" => 62529
 "AT" => 62362
 "TG" => 62289
 "CG" => 62460
 "GT" => 62838
 "TC" => 62691
 "CA" => 62459
 "GG" => 62239
 "CC" => 62486
 "GC" => 62278
 "TT" => 62697
 "AA" => 62259
 "TA" => 62799
 "CT" => 62579
 "GA" => 62333
 "AG" => 62701

barucden · July 14, 2023, 3:47pm

I haven’t timed it but what about this simple for loop?

function dinucleotides(seq::LongSequence)
    A = Alphabet(seq)

    n_letters = UInt64(length(A))
    table = zeros(n_letters, n_letters)

    prev = BioSequences.encode(A, first(seq))
    for i in (firstindex(seq) + 1):lastindex(seq)
        current = BioSequences.encode(A, seq[i])
        table[prev + 1, current + 1] += 1
        prev = current
    end

    result = Dict{Tuple{eltype(A), eltype(A)}, Int64}()
    for c in 1:n_letters
        for r in 1:n_letters
            count = table[r, c]
            if count > 0
                first_letter = BioSequences.decode(A, r - 1)
                second_letter = BioSequences.decode(A, c - 1)
                result[(first_letter, second_letter)] = count
            end
        end
    end

    return result
end

rocco_sprmnt21 · July 15, 2023, 8:38am

rseq=rand(codeunits("ACGT"),10^6)

julia> function cur3(rseq)
           cm = countmap([reinterpret(UInt16, rseq) ;reinterpret(UInt16,(@view rseq[2:end-1]))])
           Dict((join(Char.(reinterpret(UInt8,[k]))) for k in keys(cm)).=>values(cm))
       end
cur3 (generic function with 1 method)

julia> @btime cur3($rseq)
  858.100 μs (105 allocations: 2.42 MiB)
Dict{String, Int64} with 16 entries:
  "AT" => 62415
  "CA" => 62286
  "TG" => 62896
  "CC" => 62492
  "TA" => 62277
  "GT" => 62284
  "GA" => 62936
  "TT" => 62174
  "AC" => 62541
  "CT" => 62528
  "GC" => 62950
  ⋮    => ⋮

rocco_sprmnt21 · July 15, 2023, 4:40pm

could you explain the technique used for the encode and decode functions?
if possible the implementation details.
How is it different from using a simple dictionary: A=>1; C=>2 etc?

Why do you use the values 1, 2, 4 and 8 (powers of 2) instead of 1,2,3,4 (you may have a 4X4 matrix instead of 16X16)?
if the problem was related (I don’t know if they are significant in the real case) to trinucleotides or n-nucleotides in general, would this grouping technique still work?

camilogarciabotero · July 15, 2023, 6:15pm

For the time being I’ve found the @gdalle suggestion the most appealing. The idea to count bytes sounds attractive, though.

I was also trying to use the count function but seems to be inefficient when doing recursively to store all possible dinucleotides values:

const DINUCLEOTIDES = let
    x = [LongSequence{DNAAlphabet{2}}([i, j]) for i in ACGT, j in ACGT]
    # x = [string(i, j) for i in ACGT, j in ACGT]
    Dict(x .=> 0)
end

function counter(sequence::LongDNA{4})
    dn = DINUCLEOTIDES
    s = string(@view sequence[begin:end])
    for (key, value) in dn
        dn[key] = count(string(key), s, overlap=true)
    end
    return dn
end

I was also thinking if there is any way to perhaps extend the count method to the LongDNA{4} types from BioSequences.jl, but couldn’t make it efficient… Here is a test for counter

rseq = randdnaseq(10^6)

@btime counter($rseq);
  79.553 ms (239 allocations: 1.23 MiB)

barucden · July 15, 2023, 7:41pm

Using the encode and decode functions is more generic as it supports any alphabet from BioSequences (not just 2-bit DNA sequences).

The alphabet of the random sequence (as returned by randdnaseq) allows for more symbols than just A, C, T, G (see https://github.com/BioJulia/BioSequences.jl/blob/1ccb6028c92b47655d7b0aea175c0d3a83f494bd/src/alphabet.jl#L153). The specific numeric values are determined by BioSequences, and although I am sure there is a reason for that, I don’t know why they decided to go with this encoding.

Well, for n-nucleotides, one needs to remember last n-1 nucleotides. It could be done, e.g., using some circular buffer. And of course the table that captures the counts would have to be n-dimensional.

barucden · July 15, 2023, 7:52pm

This function is not a good candidate.

It cannot be called multiple times since it’s using the global dictionary for keeping the counts.
s = string(@view sequence[begin:end]) can be written as s = string(sequence). It creates a copy of the input sequence in a sense.
The whole sequence is scanned 16 times.

@gdalle’s suggestion is probably the shortest one with a reasonable performance. What I proposed is longer but also almost 10 times faster.

rocco_sprmnt21 · July 15, 2023, 8:08pm

the function of @barucden, while remaining very general, is at least 50 times faster than counter()

julia> rseq = randdnaseq(10^6)
1000000nt DNA Sequence:
GTAACTTATGACCTTCAGTTCAATGGTCACGTATATAAT…ACTGAACGCCTACGGTGAAATTAGGCAACGCAATACCCG

julia> function dinucleotides(seq::LongSequence)
           A = Alphabet(seq)

           n_letters = UInt64(length(A))
           table = zeros(n_letters, n_letters)

           prev = BioSequences.encode(A, first(seq))
           for i in (firstindex(seq) + 1):lastindex(seq)
               current = BioSequences.encode(A, seq[i])
               table[prev + 1, current + 1] += 1
               prev = current
           end
           # table
           result = Dict{Tuple{eltype(A), eltype(A)}, Int64}()
           for c in 1:n_letters
               for r in 1:n_letters
                   count = table[r, c]
                   if count > 0
                       first_letter = BioSequences.decode(A, r - 1)
                       second_letter = BioSequences.decode(A, c - 1)
                       result[(first_letter, second_letter)] = count
                   end
               end
           end

           return result
       end
dinucleotides (generic function with 1 method)


julia> @btime dinucleotides($rseq)
  1.125 ms (8 allocations: 3.41 KiB)
Dict{Tuple{DNA, DNA}, Int64} with 16 entries:
  (DNA_G, DNA_T) => 62755
  (DNA_T, DNA_A) => 62238
  (DNA_G, DNA_C) => 62691
  (DNA_C, DNA_T) => 62447
  (DNA_A, DNA_A) => 62399
  ⋮              => ⋮

rocco_sprmnt21 · July 15, 2023, 8:14pm

it may be for the convenience of using expressions that manipulate bits using shifts instead of multiplications

rocco_sprmnt21 · July 15, 2023, 8:26pm

I found a function that maps the “ACGT” codes to the range 1:4, but despite the less generality of the solution, the execution times are longer.
This evidently depends on the correct definition of suitable support structures.
Which?


cod(x::UInt8)=rem(x*0x3,0x05)+0x01

julia> rseq = randdnaseq(10^6)
1000000nt DNA Sequence:
CTTACCGGCGTACTGACTCGCTTGTCGTTCAATTAGGAG…CTATAGTATGAAATCGCTCGGGTCTCGCGATCTGTTCCT
julia> function dinuc(rseq)
           seq=cod.(codeunits(string(rseq)))
           table = zeros(Int,0x4, 0x4)
           for i in (firstindex(seq) + 1):lastindex(seq)
               table[seq[i-1] , seq[i] ] += 1
           end
           result = Dict{Tuple{DNA, DNA}, Int64}()
            l=dna"ACTG"
           foreach(((r,c),)->result[(l[r], l[c])] = table[r, c], tuple.((0x01:0x4)', 0x01:0x4))
           return result
       end
dinuc (generic function with 1 method)

julia> rseq = randdnaseq(10^6)
1000000nt DNA Sequence:
TCACAAGTCCGAATTGACCGCCGCACAGCAGGACTGAGT…CGGACTAAGCCAGAATCCTCCGAAATCAATGAGAAATTG

julia> @btime dinuc($rseq)
  3.293 ms (26 allocations: 2.18 MiB)
Dict{Tuple{DNA, DNA}, Int64} with 16 entries:
  (DNA_G, DNA_T) => 62467
  (DNA_T, DNA_A) => 62947
  (DNA_G, DNA_C) => 62291
  (DNA_C, DNA_T) => 62602
  (DNA_A, DNA_A) => 62781
  (DNA_C, DNA_C) => 62501
  (DNA_G, DNA_G) => 62399
  (DNA_C, DNA_G) => 62433
  (DNA_T, DNA_T) => 62622
  (DNA_T, DNA_C) => 62464
  ⋮              => ⋮

As proof of the fact, if I take the transformation (which requires the conversion of the dna sequence into a string) out of the function, the times are shorter

seq=cod.(codeunits(string(rseq)))

function dinuc(seq)
    #seq=cod.(codeunits(string(rseq)))
    table = zeros(Int,0x4, 0x4)
    for i in (firstindex(seq) + 1):lastindex(seq)
        table[seq[i-1] , seq[i] ] += 1
    end
    result = Dict{Tuple{DNA, DNA}, Int64}()
     l=dna"ACTG"
    foreach(((r,c),)->result[(l[r], l[c])] = table[r, c], tuple.((0x01:0x4)', 0x01:0x4))
    return result
end

julia> @btime dinuc($seq)
  794.000 μs (9 allocations: 1.55 KiB)
Dict{Tuple{DNA, DNA}, Int64} with 16 entries:
  (DNA_G, DNA_T) => 62467
  ⋮              => ⋮