# Find the position of a single non-matching character between two strings

To find the position of a single non-matching character between two strings:

s1 = "part1_Z_part2.txt"
s2 = "part1_Y_part2.txt"

ix = findfirst([!=(c[1],c[2]) for c in zip(s1,s2)])


Is there a simpler way to write findfirst without creating the temporary vector? Is there a better way?

1 Like

Maybe the definition of simple can be discussed, if plain old for loop are an option, then the following code works, is readable and doesnâ€™t allocate. (But it is not a one-liner )

function firstdiff(s1, s2)

if length(s1) != length(s2)
return min(length(s1), length(s2)) + 1
end

for (i,(c1,c2)) in enumerate(zip(s1,s2))
if c1 != c2
return i
end
end

return 0
end


(Oh, I just noted your profile name, so it wasnâ€™t a beginner question. For sure that solution was obvious to you anyway )

2 Likes

It is definitely better and it is simple, but not simpler.

1 Like

Ok, I have an obscure, (not simple), non-allocating oneâ€¦

minimum( (a==b) ? typemax(Int64) : i for (i,(a,b)) in enumerate(zip(s1,s2)) )


[Itâ€™s somehow surprisingly funny/difficult to find a good one-liner given how simple the question is.]

1 Like

No access to computer here, but does it work with a generator instead of an array comprehension? I suspect it requires indexing, though.

It doesnâ€™t, as you suspect. findall needs keys to get the index to return, and keys isnâ€™t defined for zip (related (stale) issue)

1 Like

Somehow notâ€¦

julia> findfirst( a!=b for (a,b) in zip(s1,s2) )
ERROR: MethodError: no method matching keys(::Base.Iterators.Zip{Tuple{String, String}})
Closest candidates are:
keys(::IndexStyle, ::AbstractArray, ::AbstractArray...) at ~/julia/1.7.3/share/julia/base/abstractarray.jl:350
keys(::Tuple) at ~/julia/1.7.3/share/julia/base/tuple.jl:72
keys(::Tuple, ::Tuple...) at ~/julia/1.7.3/share/julia/base/tuple.jl:77


With @digital_carver / @DNF hints of why zip makes troubleâ€¦ so that would work and not allocate:

findfirst( s1[i] != s2[i] for i in 1:min(length(s1),length(s2)) )

1 Like

With the caveat that this only works for ASCII strings. For eg.

julia> s1 = "cafĂ©teria"; s2 = "cafĂ©meria";

julia> findfirst( s1[i] != s2[i] for i in 1:min(length(s1),length(s2)) )
ERROR: StringIndexError: invalid index [5], valid nearby indices [4]=>'Ă©', [6]=>'t'

1 Like

Actually as I haved asked this with file names in mind, that are similar but for one letter, I guess Unicode should be allowed too.

Good catchâ€¦ Uff. Does this count? (Anyway, I should work on my thesis, so I will give up, curious what tricks people will find.)

ind = for (i,(a,b)) in enumerate(zip(s1,s2));  a == b || return i; end


Note that for String you can probably do better (performance-wise) by comparing bytes in the codeunits(s1) and codeunits(s2) arrays, then converting the resulting byte index back to a string index with thisind.

That does the trick, works for the Unicode character strings too. Basically a shortened version of your original function, without the length check. That also means that when the lengths are different with one of them being a substring of the other, for eg. â€śjuliaâ€ť and â€śjulialangâ€ť, it returns nothing - not sure if thatâ€™s okay or not for @rafael.guerra 's use case.

(Tangential, but I couldnâ€™t find this return in a for loop documented in the manual section on loops or in REPL docstrings. Does it only work in global scope, where can I find more about it?)

Nope, none of the solutions in this thread so far have been correct for Unicode; they confuse character indices (ala enumerate) with string indices.

1 Like

Oh yeah, I was gonna mention that in a now-abandoned post. By string indices you mean byte indices I presume? If itâ€™s something user-facing, graphemes may also be the thing to consider.

I mean indices that you can actually use to index into the string, i.e. an index i where s1[i] != s2[i] is valid, so you can use it for subsequent processing. Yes, technically this is a codeunit index (a byte index for String).

For example, this implementation is both faster than anything posted so far and is correct for Unicode (in that it returns a valid index or nothing), though it doesnâ€™t take Unicode normalization into account:

const UTF8String = Union{String,SubString{String}}
function firstdiff_index(s1::UTF8String, s2::UTF8String)
c1, c2 = codeunits(s1), codeunits(s2)
@inbounds for i in 1:min(length(c1),length(c2))
c1[i] != c2[i] && return thisind(s1, i)
end
return nothing
end


What would the user do with a grapheme index?

2 Likes

If Iâ€™m not mistaken, @SteffenPL got the simplest solution for ASCII strings in post#8, but only @stevengjâ€™s solution provides correct answers for Unicode strings. Thanks to all.

2 Likes
findfirst(==(only(setdiff(s1,s2))), s1)

1 Like

This is > 50\times slower than a loop, and is also somewhat different from the other solutions in that it fails if s1 and s2 differ in more than a single character, instead of returning the first mismatch.

Update: also incorrect as described below

Note that if you want something that works for arbitrary AbstractString subtypes (not just UTF-8 encodings), you could use:

function firstdiff_indices(s1::AbstractString, s2::AbstractString)
for ((i1, c1), (i2, c2)) in zip(pairs(s1), pairs(s2))
c1 != c2 && return (i1, i2)
end
return nothing
end


(Note that in this case you need to return two indices in general, since s1 and s2 might have different indexing schemes.) Itâ€™s non-allocating, but is about 5x slower than the byte-scan method for String.

1 Like