Dictionary from a String Sequence Iteration

Jdbeck66 · January 10, 2022, 8:16pm

I’m looking for a more elegant, maybe faster way to create dictionary of position => difference between two string sequences. This is what I’m currently using.

 function mutationpositions(wildtype, variant)
    dict = Dict{Int,Char}();
    for i ∈ 1:length(wildtype)
        if wildtype[i] != variant[i]
                push!(dict,  i => variant[i])
        end
    end
    return dict
end

I thought maybe I could use a list comprehension with zip kind of like this:

count(((a,b),) -> a != b, zip(sequence₁,sequence₂))

but can’t seem to figure out how to capture the position.

Thanks for any help.

JB

oheil · January 10, 2022, 8:53pm

I don't know if this is faster and it seems that `wildtype` and `variant` have to be of same length. So perhaps you like this, but it doesn't create a Dict:
bv = split(wildtype,"") .== split(variant,"") indices = findall( x->x==0, bv) variants = split(variant,"")[.!bv]
;-) Stefan is right:

StefanKarpinski · January 10, 2022, 9:00pm

It’s generally not recommended to do genome stuff with strings. It’s an inefficient representation since you only need two bits per DNA nucleotide and strings have to handle the complexity of potentially holding Unicode data, which cannot happen with DNA. Consider using the BioJulia packages designed for genetic data:

https://biojulia.github.io/Bio.jl/man/seq/

tfiers · January 10, 2022, 9:17pm

[for me, that link should be Biological Sequences - Bio.jl – or rather Biological Symbols · BioSequences.jl]

rocco_sprmnt21 · January 10, 2022, 9:32pm

without prejudice to all the advice to use the suitable package, a function (toy) to do what you ask could be the following:

findeachpos(s1,s2)=(first(p) for p in enumerate(zip(s1,s2)) if last(p)[1]!=last(p)[2])
collect(findeachpos(s1,s2))

tfiers · January 10, 2022, 10:49pm

Or with unpacking/destructuring to variable names:

mutationpositions = Dict(
    i => v for (i, (w, v)) in
    enumerate(zip(wildtype, variant))
    if w != v
)

StefanKarpinski · January 11, 2022, 1:49am

Weirdly I copied a URL from the browser so not sure why it would later not work in the same browser.

Jdbeck66 · January 11, 2022, 3:56pm

Thanks for the info. I’ve been planning on checking out the biojulia package. I have a code base already written that I’ll have to refactor. And it was as much about learning code alternatives. I did not even think to wrap enumerate around zip - that’s the idea I was looking for.

Thanks again - this discourse community is the best I’ve worked with. Very helpful.

Topic		Replies	Views
Find the position of a single non-matching character between two strings General Usage strings , unicode	27	1540	August 30, 2022
Find position on list (fast) General Usage	13	779	November 22, 2021
Can't figure out why this code isn't working General Usage dictionary , string-dist	4	401	September 21, 2021
Comparing Unequal Dictionaries New to Julia	7	1336	December 19, 2020
Char vs. String for Dict key New to Julia	6	1889	August 11, 2017

Dictionary from a String Sequence Iteration

Related topics