How to improve a Generator to be more memory efficient when it is collected?

Hi all,

I wrote a function that searches through a sequence (a BioSequences.LongDNA but could also be a String) for start codons (predefined as constants subsequence searches) and generates ranges of indices corresponding to the start of start codons of the sequence. Later, when two conditions are met (no “premature” stops and length), then it generates stop range. It is fast when defined, but very slow when collected and allocates a lot of memory for a small sequence:

const STOPCODONS = [dna"TAG", dna"TAA"), dna"TGA"]
const STARTCODON = ExactSearchQuery(dna"ATG", iscompatible)
const EXTENDED_STARTCODONS = ExactSearchQuery(dna"DTG", iscompatible)

function eachcodon(sequence::LongDNA)
    seqbound = length(sequence) - 2
    return(sequence[i:i+2] for i in 1:3:seqbound)
end

function hasprematurestop(sequence::LongDNA)::Bool
    stop_codon_count = 0
    @inbounds for codon in eachcodon(sequence)
        if codon ∈ STOPCODONS
            stop_codon_count += 1
        end
    end
    stop_codon_count > 1
end

function locationgenerator(sequence::LongDNA; alternative_start::Bool=false, min_len::Int64=6)
    seqbound = length(sequence) - 2
    startcodons = alternative_start ? EXTENDED_STARTCODONS : STARTCODON
    start_codon_indices = findall(startcodons, sequence)
    @inbounds begin
        location = (i.start:j+2 for i in start_codon_indices for j in i.start:3:seqbound if sequence[j:j+2] ∈ STOPCODONS && !hasprematurestop(sequence[i.start:j+2]) && length(i.start:j+2) >= min_len)
    end
    return location
end

Here a small benchmark:

using BioSequences
using BenchmarkTools
seq = randdnaseq(1*10^4) # only 10Kb string

@benchmark collect(locationgenerator(seq))

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.197 s …   1.250 s  ┊ GC (min … max): 5.80% … 6.28%
 Time  (median):     1.222 s              ┊ GC (median):    6.42%
 Time  (mean ± σ):   1.220 s ± 21.299 ms  ┊ GC (mean ± σ):  6.24% ± 0.33%

  █      █                  █      █                      █  
  █▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.2 s          Histogram: frequency by time        1.25 s <

 Memory estimate: 1.29 GiB, allocs estimate: 28430466.

Any suggestions on how to improve this?

Try adding @views here to avoid allocations of slices in that line:

Because things like this:

Allocate intermediate arrays.

(Although since this is a string, you may need something like StringViews.jl here)

Thanks,

I did try @views, and nothing changed though. It seems to still allocate the same memory.

How come? I thought that the BioSequences package could deal with these particular strings, but I’ll check StringViews.jl!

I’ve been playing with the string version of the sequence, I mean the problem is to find the sequences starting with the codon “ATG” and ending with a codon of these e = [“TAG”, “TAA”, “TGA”].
The execution time is 5000X less than the time it takes to collect(locationgenerator(seq)).
The result is the same only for the left end of the intervals.
Evidently I don’t quite understand what the conditions of the problem are.
But given the greater speed, it would be appropriate to follow a similar path, fixing what needs to be fixed.

ulia> seq = randdnaseq(1*10^4)
10000nt DNA Sequence:
AGGACGGGATGTGGAGATACATTTCCGCCCAACACGTCC…AATCGCCGGCGCTAGGTCTCTATGACGCGTTTGTGTAAG

julia> @btime collect(locationgenerator(seq));
  819.104 ms (26915140 allocations: 1.22 GiB)

julia> sseq=string(seq)
"AGGACGGGATGTGGAGATACATTTCCGCCCAACACGTCCATACGGCTTAACAACGTCTCCACTTCAGCGAGGGCCCATTCTGATTGGGCAATAAACTCTATGTCTTTTATGATGCATTTCAATCAGCTCTGACCGTTATCCCCGTTTTAGTACATACGGCCGATCTTGAGCGTGTATGTACGGACACTCGCGGGACATATTTGGCGAGGGGGTTTTCCACTGCAAATGGCACGAAGCACTCTCGCGTGCGCTTCTC" ⋯ 9488 bytes ⋯ "CCACGGCGTCGACAGCTCTATTGCATACCGCCTAGATAGCCCTCCACCATTTGGCCTAATCAAACGGGCTATACCCATCTGATACACTCTTAGCAGAGTAATGGGCCAGCGCCGGGATTTATGGAAGTTCCTTCGCGGGCCCCGTGGACGCCTGGGCGGCACGTTTGGCTGCGACGTATACCGGAAAATACTGTCACAGCTCGTGCTACAAGTTCTGAATCGCCGGCGCTAGGTCTCTATGACGCGTTTGTGTAAG"

julia> e = ["TAG", "TAA", "TGA"]
3-element Vector{String}:
 "TAG"
 "TAA"
 "TGA"

julia> b = "ATG"
"ATG"

julia> function starts(sseq,b)
           B=findfirst(b,sseq)
           eos=true
           v=Int[]
           while eos
               push!(v,first(B))
               B=findnext(b,sseq,last(B)+1)
               eos=!isnothing(B)
           end
           v
       end
starts (generic function with 1 method)

julia> function ends(sseq,st)
           B=st[1]:st[1]
           u=UnitRange[]
           for i in eachindex(st[1:end-1])
               p=findlast.(e,sseq[st[i]:st[i+1]])
               nnp=findfirst(!isnothing,p)
               if !isnothing(nnp)
                   push!(u,(:)(st[i],last(p[nnp])+st[i]-1))
               end
           end
           u
       end
ends (generic function with 1 method)

julia> function startend(sseq,b,e)
           st=starts(sseq,b)
           ends(sseq,st)
       end
startend (generic function with 1 method)

julia> @btime startend(sseq,b,e)
  142.300 μs (2758 allocations: 141.77 KiB)
125-element Vector{UnitRange}:
 9:94
 109:112
 112:150
 226:259
 303:315
 346:398
 413:445
 454:571
 739:765
 788:846
 892:906
 908:954
 984:999
 1004:1052
 1066:1119
 1121:1211
 1277:1326
 1341:1378
 1384:1477
 1548:1564
 ⋮
 7934:7981
 8021:8377
 8408:8419
 8488:8496
 8524:8591
 8604:8620
 8625:8630
 8680:8761
 8804:8819
 8821:8862
 8891:8930
 8940:8950
 8978:9074
 9187:9252
 9266:9269
 9302:9370
 9405:9434
 9466:9656
 9687:9837
 9865:9976

julia> rc=collect(locationgenerator(seq))
149-element Vector{UnitRange{Int64}}:
 9:50
 100:132
 109:132
 112:132
 176:259
 226:315
 303:371
 346:405
 413:430
 454:699
 739:753
 788:1036
 850:906
 892:906
 908:1036
 984:1004
 1004:1036
 1066:1089
 1121:1150
 1277:1354
 ⋮
 8524:8607
 8604:8630
 8625:8630
 8680:8844
 8798:8815
 8804:8815
 8821:8844
 8891:8905
 8934:9005
 8940:9005
 8978:9139
 9187:9252
 9266:9370
 9302:9370
 9405:9434
 9447:9581
 9466:9492
 9687:9779
 9845:9976
 9865:9999

Or this more compact version that searches for all [st_codon … end_codon] sequences where end_codon is the first codon in the terminal codon list after the current st_codon.

My algorithm in the following chunk of sequence takes up to the first TAA(sseq[810:862]) while the generator takes up to the second

julia> sseq[810:866]
"ATGCTCTCCATCACCCGTCAGTCTGTCCCGCGGTGCTTCGAACCACTCCATAATTAA"

I adapted the algorithm (at least for a part of the conditions that I guessed) and the results coincide (in many tests done; several random generations of strings of 10^4 nitrogenous bases )

The ratio between the times is always above 5000X

julia> e = ["TAG", "TAA", "TGA"]
3-element Vector{String}:
 "TAG"
 "TAA"
 "TGA"

julia> b = "ATG"
"ATG"

julia> function startend1(sseq,b,e)
           B=findfirst(b,sseq)
           eos=true
           v=UnitRange[]
           L=length(sseq)
           while eos
               Er=findfirst(i->sseq[i:i+2] ∈ e, last(B)+1:3:L-2)
               if !isnothing(Er)
                   E=Er*3+last(B)
                   push!(v,(:)(first(B),last(E)))
                   B=findnext(b,sseq,last(B)+1)
                   eos=!isnothing(B)
               elseif !isnothing(B)
                   B=findnext(b,sseq,last(B)+1)
                   eos=!isnothing(B)
               else
                   eos=false
               end
           end
           v
       end
startend1 (generic function with 1 method)

julia> seq = randdnaseq(1*10^4);

julia> rc=collect(locationgenerator(seq));

julia> @btime collect(locationgenerator(seq));
  691.770 ms (22726178 allocations: 1.03 GiB)

julia> sseq=string(seq);

julia> @btime startend1(sseq,b,e);
  125.900 μs (3522 allocations: 97.33 KiB)

julia> rc == startend1(sseq,b,e)
true

julia> @btime startend1(sseq,b,e);
  131.100 μs (3522 allocations: 97.33 KiB)

julia> @btime startend1(sseq,b,e);
  122.100 μs (3522 allocations: 97.33 KiB)

julia> @btime startend1(sseq,b,e);
  111.000 μs (3522 allocations: 97.33 KiB)

Tried to patch together a function using regular expressions to get the same output. It even performed well:

using IterTools
import Base.Iterators as Itr

const sseq=join(rand(('A','G','T','C'), 10^4));  # const by @rafael's suggestion

f(R) = findnext(r"ATG(?:[AGTC][AGTC][AGTC])*?(TAG|TAA|TGA)",sseq,first(R)+3)
itr = Itr.takewhile(!isnothing, Itr.drop(iterated(f, -2:-2),1))
# that's all !!

# r"ATG(?:[AGTC]{3})*?T(AG|AA|GA)" also works
# previous post function:
e = ["TAG", "TAA", "TGA"]
b = "ATG"
function startend1(sseq,b,e)
    B=findfirst(b,sseq)
    eos=true
    v=UnitRange[]
    L=length(sseq)
    while eos
        Er=findfirst(i->sseq[i:i+2] ∈ e, last(B)+1:3:L-2)
        if !isnothing(Er)
            E=Er*3+last(B)
            push!(v,(:)(first(B),last(E)))
            B=findnext(b,sseq,last(B)+1)
            eos=!isnothing(B)
        elseif !isnothing(B)
            B=findnext(b,sseq,last(B)+1)
            eos=!isnothing(B)
        else
            eos=false
        end
    end
    v
end

And now for a bit of benchmarking:

julia> using BenchmarkTools

julia> length([R for R in itr]) == length(startend1(sseq, b, e))
true

julia> @btime [R for R in itr];
  59.781 μs (325 allocations: 11.39 KiB)

julia> @btime startend1(sseq,b,e);
  180.610 μs (4022 allocations: 109.98 KiB)

Hope all this iterator wrapping is liked, as it is not everyone’s cup of tea.

2 Likes

wow!
could you explain, step by step, the regex part?

ATG at start looks for START codon, then (?: )*? repeats whatever is inside 0 or more times, but lazily (that’s the ? after *). This skips codons until the first (b/c lazy) of the next group. The next group is 3 string options covering the end codons. The [AGTC] repeating three times matches a codon, as each square bracket matches a single letter from inside it.

So actually, it could be even simpler:
r"ATG(?:[AGTC]{3})*?T(AG|AA|GA)"

This link gives all the common and uncommon regular expression syntax:
https://www.pcre.org/current/doc/html/pcre2syntax.html

2 Likes

For better performance, shouldn’t sseq be passed as an argument to the function f(), or at least define it as const sseq ?

1 Like

Yep (with const sseq = ...):

julia> @btime [R for R in itr];
  55.855 μs (16 allocations: 4.11 KiB)

vs.

julia> @btime [R for R in itr];
  59.781 μs (325 allocations: 11.39 KiB)

earlier.

1 Like

regex + loop

julia> f(R, sseq) = findnext(r"ATG(?:[AGTC][AGTC][AGTC])*?(TAG|TAA|TGA)",sseq,first(R)+3)
f (generic function with 2 methods)

julia> itr = Itr.takewhile(!isnothing, Itr.drop(iterated(f, -2:-2),1))
Base.Iterators.TakeWhile{Base.Iterators.Drop{IterTools.Iterated{UnitRange{Int64}, typeof(f)}}, Base.var"#97#98"{typeof(isnothing)}}(Base.var"#97#98"{typeof(isnothing)}(isnothing), Base.Iterators.Drop{IterTools.Iterated{UnitRange{Int64}, typeof(f)}}(IterTools.Iterated{UnitRange{Int64}, typeof(f)}(f, -2:-2), 1))

julia> @btime [R for R in itr];
  50.600 μs (339 allocations: 11.77 KiB)

julia> function se(sseq)
           eos=true
           R=-2:0
           v=UnitRange[]
           while eos
               R=f(R,sseq)
               !isnothing(R) && push!(v,R)
               eos = !isnothing(R)
           end
           v
       end
se (generic function with 1 method)

julia> @btime se(sseq);
  44.600 μs (173 allocations: 7.19 KiB)

julia> se(sseq)==[R for R in itr]
1 Like

This is so cool!

Can you explain the iterator, please?

itr = Itr.takewhile(!isnothing, Itr.drop(iterated(f, -2:-2),1))

The main work is done by iterated which is IterTools.iterated and returns each iteration a function applied to the value of the last iteration.
f is made to return the next match after the last match which is given as a parameter. This provides the series of matches.
The edges of the iterator need to be fixed a bit, and takewhile makes sure the iteration stops at the first nothing result without returning it.
The drop(.., 1) gets rid of the initial condition used to start searching for the first match.
That’s about it.

1 Like

Nice, why is the range -2:-2 inside iterated?

is the initial condition. The logic skips to next codon, which is 3 letters ahead, going from -2 to +1, which is start of string.

2 Likes

Maybe this could be optimized a bit to:

(eos = !isnothing(R)) && push!(v,R)
1 Like

I had tried that, but it was giving me some kind of error and I didn’t delve into it.
it works?

the -2:-2 got me thinking too. In fact what matters is the first -2. The range could be anything of the type -2: Int.
To remove this intruder and also drop the the drop function, a scheme of the following type could be used which perhaps (IMO) makes this beautiful solution even more elegant.

ptrn=r"ATG(?:[AGTC][AGTC][AGTC])*?(TAG|TAA|TGA)"
f(R) = findnext(ptrn,sseq,first(R)+3)
itr = Itr.takewhile(!isnothing, iterated(f, findfirst(ptrn,sseq)))

PS
what is the purpose of the hasrpematurestop function?

3 Likes

This is getting quite simpler, thanks.

Regarding

In the first function, the Generator was adding extra STOP codons (don’t really know why), so with the extra condition of hasprematurestop that was fixed, yet it meant extra performance cost.

Thank you for all the suggestions, I’ve come to reformulate them in terms of the BioSequences types:

using BioSequences
using IterTools

const REGULAR_CDS = biore"ATG(?:[N]{3})*?T(AG|AA|GA)"dna
const EXTENDED_REGULAR_CDS = biore"DTG(?:[N]{3})*?T(AG|AA|GA)"dna

function locationiterator(sequence::LongDNA; alternative_start::Bool=false)
    regcds = alternative_start ? EXTENDED_REGULAR_CDS : REGULAR_CDS
    finder(x) = findfirst(regcds, sequence, first(x) + 1) # + 3
    itr = Iterators.takewhile(!isnothing, iterated(finder, findfirst(regcds, sequence)))
    return itr
end

Note that I also think that the finder is more exhaustive if it just adds 1 to the next search instead of 3, and you might have noticed that I’m using findfirst instead of findnext but the biore"" does not have a defined findnext method, yet the findfirst is doing the job.

And when it comes to collect a huge 100Mb seq it is fast and quite efficient!

seq = randdnaseq(10^8)

Julia> @time collect(locationiterator(seq))
  3.083502 seconds (15.62 M allocations: 813.886 MiB, 29.51% gc time)
1561744-element Vector{UnitRange{Int64}}:

What still don’t understand is that it allocates 800Mb when the provided initial seq is 100Mb…