I wrote a function that searches through a sequence (a BioSequences.LongDNA but could also be a String) for start codons (predefined as constants subsequence searches) and generates ranges of indices corresponding to the start of start codons of the sequence. Later, when two conditions are met (no “premature” stops and length), then it generates stop range. It is fast when defined, but very slow when collected and allocates a lot of memory for a small sequence:
const STOPCODONS = [dna"TAG", dna"TAA"), dna"TGA"]
const STARTCODON = ExactSearchQuery(dna"ATG", iscompatible)
const EXTENDED_STARTCODONS = ExactSearchQuery(dna"DTG", iscompatible)
function eachcodon(sequence::LongDNA)
seqbound = length(sequence) - 2
return(sequence[i:i+2] for i in 1:3:seqbound)
end
function hasprematurestop(sequence::LongDNA)::Bool
stop_codon_count = 0
@inbounds for codon in eachcodon(sequence)
if codon ∈ STOPCODONS
stop_codon_count += 1
end
end
stop_codon_count > 1
end
function locationgenerator(sequence::LongDNA; alternative_start::Bool=false, min_len::Int64=6)
seqbound = length(sequence) - 2
startcodons = alternative_start ? EXTENDED_STARTCODONS : STARTCODON
start_codon_indices = findall(startcodons, sequence)
@inbounds begin
location = (i.start:j+2 for i in start_codon_indices for j in i.start:3:seqbound if sequence[j:j+2] ∈ STOPCODONS && !hasprematurestop(sequence[i.start:j+2]) && length(i.start:j+2) >= min_len)
end
return location
end
Here a small benchmark:
using BioSequences
using BenchmarkTools
seq = randdnaseq(1*10^4) # only 10Kb string
@benchmark collect(locationgenerator(seq))
BenchmarkTools.Trial: 5 samples with 1 evaluation.
Range (min … max): 1.197 s … 1.250 s ┊ GC (min … max): 5.80% … 6.28%
Time (median): 1.222 s ┊ GC (median): 6.42%
Time (mean ± σ): 1.220 s ± 21.299 ms ┊ GC (mean ± σ): 6.24% ± 0.33%
█ █ █ █ █
█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.2 s Histogram: frequency by time 1.25 s <
Memory estimate: 1.29 GiB, allocs estimate: 28430466.
I’ve been playing with the string version of the sequence, I mean the problem is to find the sequences starting with the codon “ATG” and ending with a codon of these e = [“TAG”, “TAA”, “TGA”].
The execution time is 5000X less than the time it takes to collect(locationgenerator(seq)).
The result is the same only for the left end of the intervals.
Evidently I don’t quite understand what the conditions of the problem are.
But given the greater speed, it would be appropriate to follow a similar path, fixing what needs to be fixed.
ulia> seq = randdnaseq(1*10^4)
10000nt DNA Sequence:
AGGACGGGATGTGGAGATACATTTCCGCCCAACACGTCC…AATCGCCGGCGCTAGGTCTCTATGACGCGTTTGTGTAAG
julia> @btime collect(locationgenerator(seq));
819.104 ms (26915140 allocations: 1.22 GiB)
julia> sseq=string(seq)
"AGGACGGGATGTGGAGATACATTTCCGCCCAACACGTCCATACGGCTTAACAACGTCTCCACTTCAGCGAGGGCCCATTCTGATTGGGCAATAAACTCTATGTCTTTTATGATGCATTTCAATCAGCTCTGACCGTTATCCCCGTTTTAGTACATACGGCCGATCTTGAGCGTGTATGTACGGACACTCGCGGGACATATTTGGCGAGGGGGTTTTCCACTGCAAATGGCACGAAGCACTCTCGCGTGCGCTTCTC" ⋯ 9488 bytes ⋯ "CCACGGCGTCGACAGCTCTATTGCATACCGCCTAGATAGCCCTCCACCATTTGGCCTAATCAAACGGGCTATACCCATCTGATACACTCTTAGCAGAGTAATGGGCCAGCGCCGGGATTTATGGAAGTTCCTTCGCGGGCCCCGTGGACGCCTGGGCGGCACGTTTGGCTGCGACGTATACCGGAAAATACTGTCACAGCTCGTGCTACAAGTTCTGAATCGCCGGCGCTAGGTCTCTATGACGCGTTTGTGTAAG"
julia> e = ["TAG", "TAA", "TGA"]
3-element Vector{String}:
"TAG"
"TAA"
"TGA"
julia> b = "ATG"
"ATG"
julia> function starts(sseq,b)
B=findfirst(b,sseq)
eos=true
v=Int[]
while eos
push!(v,first(B))
B=findnext(b,sseq,last(B)+1)
eos=!isnothing(B)
end
v
end
starts (generic function with 1 method)
julia> function ends(sseq,st)
B=st[1]:st[1]
u=UnitRange[]
for i in eachindex(st[1:end-1])
p=findlast.(e,sseq[st[i]:st[i+1]])
nnp=findfirst(!isnothing,p)
if !isnothing(nnp)
push!(u,(:)(st[i],last(p[nnp])+st[i]-1))
end
end
u
end
ends (generic function with 1 method)
julia> function startend(sseq,b,e)
st=starts(sseq,b)
ends(sseq,st)
end
startend (generic function with 1 method)
julia> @btime startend(sseq,b,e)
142.300 μs (2758 allocations: 141.77 KiB)
125-element Vector{UnitRange}:
9:94
109:112
112:150
226:259
303:315
346:398
413:445
454:571
739:765
788:846
892:906
908:954
984:999
1004:1052
1066:1119
1121:1211
1277:1326
1341:1378
1384:1477
1548:1564
⋮
7934:7981
8021:8377
8408:8419
8488:8496
8524:8591
8604:8620
8625:8630
8680:8761
8804:8819
8821:8862
8891:8930
8940:8950
8978:9074
9187:9252
9266:9269
9302:9370
9405:9434
9466:9656
9687:9837
9865:9976
julia> rc=collect(locationgenerator(seq))
149-element Vector{UnitRange{Int64}}:
9:50
100:132
109:132
112:132
176:259
226:315
303:371
346:405
413:430
454:699
739:753
788:1036
850:906
892:906
908:1036
984:1004
1004:1036
1066:1089
1121:1150
1277:1354
⋮
8524:8607
8604:8630
8625:8630
8680:8844
8798:8815
8804:8815
8821:8844
8891:8905
8934:9005
8940:9005
8978:9139
9187:9252
9266:9370
9302:9370
9405:9434
9447:9581
9466:9492
9687:9779
9845:9976
9865:9999
Or this more compact version that searches for all [st_codon … end_codon] sequences where end_codon is the first codon in the terminal codon list after the current st_codon.
My algorithm in the following chunk of sequence takes up to the first TAA(sseq[810:862]) while the generator takes up to the second
I adapted the algorithm (at least for a part of the conditions that I guessed) and the results coincide (in many tests done; several random generations of strings of 10^4 nitrogenous bases )
The ratio between the times is always above 5000X
julia> e = ["TAG", "TAA", "TGA"]
3-element Vector{String}:
"TAG"
"TAA"
"TGA"
julia> b = "ATG"
"ATG"
julia> function startend1(sseq,b,e)
B=findfirst(b,sseq)
eos=true
v=UnitRange[]
L=length(sseq)
while eos
Er=findfirst(i->sseq[i:i+2] ∈ e, last(B)+1:3:L-2)
if !isnothing(Er)
E=Er*3+last(B)
push!(v,(:)(first(B),last(E)))
B=findnext(b,sseq,last(B)+1)
eos=!isnothing(B)
elseif !isnothing(B)
B=findnext(b,sseq,last(B)+1)
eos=!isnothing(B)
else
eos=false
end
end
v
end
startend1 (generic function with 1 method)
julia> seq = randdnaseq(1*10^4);
julia> rc=collect(locationgenerator(seq));
julia> @btime collect(locationgenerator(seq));
691.770 ms (22726178 allocations: 1.03 GiB)
julia> sseq=string(seq);
julia> @btime startend1(sseq,b,e);
125.900 μs (3522 allocations: 97.33 KiB)
julia> rc == startend1(sseq,b,e)
true
julia> @btime startend1(sseq,b,e);
131.100 μs (3522 allocations: 97.33 KiB)
julia> @btime startend1(sseq,b,e);
122.100 μs (3522 allocations: 97.33 KiB)
julia> @btime startend1(sseq,b,e);
111.000 μs (3522 allocations: 97.33 KiB)
Tried to patch together a function using regular expressions to get the same output. It even performed well:
using IterTools
import Base.Iterators as Itr
const sseq=join(rand(('A','G','T','C'), 10^4)); # const by @rafael's suggestion
f(R) = findnext(r"ATG(?:[AGTC][AGTC][AGTC])*?(TAG|TAA|TGA)",sseq,first(R)+3)
itr = Itr.takewhile(!isnothing, Itr.drop(iterated(f, -2:-2),1))
# that's all !!
# r"ATG(?:[AGTC]{3})*?T(AG|AA|GA)" also works
# previous post function:
e = ["TAG", "TAA", "TGA"]
b = "ATG"
function startend1(sseq,b,e)
B=findfirst(b,sseq)
eos=true
v=UnitRange[]
L=length(sseq)
while eos
Er=findfirst(i->sseq[i:i+2] ∈ e, last(B)+1:3:L-2)
if !isnothing(Er)
E=Er*3+last(B)
push!(v,(:)(first(B),last(E)))
B=findnext(b,sseq,last(B)+1)
eos=!isnothing(B)
elseif !isnothing(B)
B=findnext(b,sseq,last(B)+1)
eos=!isnothing(B)
else
eos=false
end
end
v
end
And now for a bit of benchmarking:
julia> using BenchmarkTools
julia> length([R for R in itr]) == length(startend1(sseq, b, e))
true
julia> @btime [R for R in itr];
59.781 μs (325 allocations: 11.39 KiB)
julia> @btime startend1(sseq,b,e);
180.610 μs (4022 allocations: 109.98 KiB)
Hope all this iterator wrapping is liked, as it is not everyone’s cup of tea.
ATG at start looks for START codon, then (?: )*? repeats whatever is inside 0 or more times, but lazily (that’s the ? after *). This skips codons until the first (b/c lazy) of the next group. The next group is 3 string options covering the end codons. The [AGTC] repeating three times matches a codon, as each square bracket matches a single letter from inside it.
So actually, it could be even simpler: r"ATG(?:[AGTC]{3})*?T(AG|AA|GA)"
julia> f(R, sseq) = findnext(r"ATG(?:[AGTC][AGTC][AGTC])*?(TAG|TAA|TGA)",sseq,first(R)+3)
f (generic function with 2 methods)
julia> itr = Itr.takewhile(!isnothing, Itr.drop(iterated(f, -2:-2),1))
Base.Iterators.TakeWhile{Base.Iterators.Drop{IterTools.Iterated{UnitRange{Int64}, typeof(f)}}, Base.var"#97#98"{typeof(isnothing)}}(Base.var"#97#98"{typeof(isnothing)}(isnothing), Base.Iterators.Drop{IterTools.Iterated{UnitRange{Int64}, typeof(f)}}(IterTools.Iterated{UnitRange{Int64}, typeof(f)}(f, -2:-2), 1))
julia> @btime [R for R in itr];
50.600 μs (339 allocations: 11.77 KiB)
julia> function se(sseq)
eos=true
R=-2:0
v=UnitRange[]
while eos
R=f(R,sseq)
!isnothing(R) && push!(v,R)
eos = !isnothing(R)
end
v
end
se (generic function with 1 method)
julia> @btime se(sseq);
44.600 μs (173 allocations: 7.19 KiB)
julia> se(sseq)==[R for R in itr]
The main work is done by iterated which is IterTools.iterated and returns each iteration a function applied to the value of the last iteration. f is made to return the next match after the last match which is given as a parameter. This provides the series of matches.
The edges of the iterator need to be fixed a bit, and takewhile makes sure the iteration stops at the first nothing result without returning it.
The drop(.., 1) gets rid of the initial condition used to start searching for the first match.
That’s about it.
the -2:-2 got me thinking too. In fact what matters is the first -2. The range could be anything of the type -2: Int.
To remove this intruder and also drop the the drop function, a scheme of the following type could be used which perhaps (IMO) makes this beautiful solution even more elegant.
In the first function, the Generator was adding extra STOP codons (don’t really know why), so with the extra condition of hasprematurestop that was fixed, yet it meant extra performance cost.
Note that I also think that the finder is more exhaustive if it just adds 1 to the next search instead of 3, and you might have noticed that I’m using findfirst instead of findnext but the biore"" does not have a defined findnext method, yet the findfirst is doing the job.
And when it comes to collect a huge 100Mb seq it is fast and quite efficient!