Dictionary from a String Sequence Iteration

I’m looking for a more elegant, maybe faster way to create dictionary of position => difference between two string sequences. This is what I’m currently using.

 function mutationpositions(wildtype, variant)
    dict = Dict{Int,Char}();
    for i ∈ 1:length(wildtype)
        if wildtype[i] != variant[i]
                push!(dict,  i => variant[i])
        end
    end
    return dict
end

I thought maybe I could use a list comprehension with zip kind of like this:

count(((a,b),) -> a != b, zip(sequence₁,sequence₂))

but can’t seem to figure out how to capture the position.

Thanks for any help.

JB

I don't know if this is faster and it seems that `wildtype` and `variant` have to be of same length. So perhaps you like this, but it doesn't create a Dict:
bv = split(wildtype,"") .== split(variant,"")
indices = findall( x->x==0, bv)
variants = split(variant,"")[.!bv]
;-) Stefan is right:

It’s generally not recommended to do genome stuff with strings. It’s an inefficient representation since you only need two bits per DNA nucleotide and strings have to handle the complexity of potentially holding Unicode data, which cannot happen with DNA. Consider using the BioJulia packages designed for genetic data:

https://biojulia.github.io/Bio.jl/man/seq/

6 Likes

[for me, that link should be Biological Sequences - Bio.jl – or rather Biological Symbols · BioSequences.jl]

1 Like

without prejudice to all the advice to use the suitable package, a function (toy) to do what you ask could be the following:

findeachpos(s1,s2)=(first(p) for p in enumerate(zip(s1,s2)) if last(p)[1]!=last(p)[2])
collect(findeachpos(s1,s2))

Or with unpacking/destructuring to variable names:

mutationpositions = Dict(
    i => v for (i, (w, v)) in
    enumerate(zip(wildtype, variant))
    if w != v
)

Weirdly I copied a URL from the browser so not sure why it would later not work in the same browser.

Thanks for the info. I’ve been planning on checking out the biojulia package. I have a code base already written that I’ll have to refactor. And it was as much about learning code alternatives. I did not even think to wrap enumerate around zip - that’s the idea I was looking for.

Thanks again - this discourse community is the best I’ve worked with. Very helpful.

1 Like