Find the position of a single non-matching character between two strings

To find the position of a single non-matching character between two strings:

s1 = "part1_Z_part2.txt"
s2 = "part1_Y_part2.txt"

ix = findfirst([!=(c[1],c[2]) for c in zip(s1,s2)])

Is there a simpler way to write findfirst without creating the temporary vector? Is there a better way?

1 Like

Maybe the definition of simple can be discussed, if plain old for loop are an option, then the following code works, is readable and doesn’t allocate. (But it is not a one-liner :wink: )

function firstdiff(s1, s2)

  if length(s1) != length(s2)
    return min(length(s1), length(s2)) + 1
  end

  for (i,(c1,c2)) in enumerate(zip(s1,s2))
    if c1 != c2
      return i
    end
  end

  return 0 
end

(Oh, I just noted your profile name, so it wasn’t a beginner question. For sure that solution was obvious to you anyway :see_no_evil: )

2 Likes

It is definitely better and it is simple, but not simpler.

1 Like

Ok, I have an obscure, (not simple), non-allocating one…

minimum( (a==b) ? typemax(Int64) : i for (i,(a,b)) in enumerate(zip(s1,s2)) )

:smiley: [It’s somehow surprisingly funny/difficult to find a good one-liner given how simple the question is.]

1 Like

No access to computer here, but does it work with a generator instead of an array comprehension? I suspect it requires indexing, though.

It doesn’t, as you suspect. findall needs keys to get the index to return, and keys isn’t defined for zip (related (stale) issue)

1 Like

Somehow not…

julia> findfirst( a!=b for (a,b) in zip(s1,s2) )
ERROR: MethodError: no method matching keys(::Base.Iterators.Zip{Tuple{String, String}})
Closest candidates are:
  keys(::IndexStyle, ::AbstractArray, ::AbstractArray...) at ~/julia/1.7.3/share/julia/base/abstractarray.jl:350
  keys(::Tuple) at ~/julia/1.7.3/share/julia/base/tuple.jl:72
  keys(::Tuple, ::Tuple...) at ~/julia/1.7.3/share/julia/base/tuple.jl:77

With @digital_carver / @DNF hints of why zip makes trouble… so that would work and not allocate:

findfirst( s1[i] != s2[i] for i in 1:min(length(s1),length(s2)) )
1 Like

With the caveat that this only works for ASCII strings. For eg.

julia> s1 = "caféteria"; s2 = "cafémeria";

julia> findfirst( s1[i] != s2[i] for i in 1:min(length(s1),length(s2)) )
ERROR: StringIndexError: invalid index [5], valid nearby indices [4]=>'Ă©', [6]=>'t'
1 Like

Actually as I haved asked this with file names in mind, that are similar but for one letter, I guess Unicode should be allowed too.

Good catch… Uff. Does this count? (Anyway, I should work on my thesis, so I will give up, curious what tricks people will find.)

ind = for (i,(a,b)) in enumerate(zip(s1,s2));  a == b || return i; end

Note that for String you can probably do better (performance-wise) by comparing bytes in the codeunits(s1) and codeunits(s2) arrays, then converting the resulting byte index back to a string index with thisind.

That does the trick, works for the Unicode character strings too. Basically a shortened version of your original function, without the length check. That also means that when the lengths are different with one of them being a substring of the other, for eg. “julia” and “julialang”, it returns nothing - not sure if that’s okay or not for @rafael.guerra 's use case.

(Tangential, but I couldn’t find this return in a for loop documented in the manual section on loops or in REPL docstrings. Does it only work in global scope, where can I find more about it?)

Nope, none of the solutions in this thread so far have been correct for Unicode; they confuse character indices (ala enumerate) with string indices.

1 Like

Oh yeah, I was gonna mention that in a now-abandoned post. By string indices you mean byte indices I presume? If it’s something user-facing, graphemes may also be the thing to consider.

I mean indices that you can actually use to index into the string, i.e. an index i where s1[i] != s2[i] is valid, so you can use it for subsequent processing. Yes, technically this is a codeunit index (a byte index for String).

For example, this implementation is both faster than anything posted so far and is correct for Unicode (in that it returns a valid index or nothing), though it doesn’t take Unicode normalization into account:

const UTF8String = Union{String,SubString{String}}
function firstdiff_index(s1::UTF8String, s2::UTF8String)
    c1, c2 = codeunits(s1), codeunits(s2)
    @inbounds for i in 1:min(length(c1),length(c2))
        c1[i] != c2[i] && return thisind(s1, i)
    end
    return nothing
end

What would the user do with a grapheme index?

2 Likes

If I’m not mistaken, @SteffenPL got the simplest solution for ASCII strings in post#8, but only @stevengj’s solution provides correct answers for Unicode strings. Thanks to all.

2 Likes
findfirst(==(only(setdiff(s1,s2))), s1)
1 Like

This is > 50\times slower than a loop, and is also somewhat different from the other solutions in that it fails if s1 and s2 differ in more than a single character, instead of returning the first mismatch.

Update: also incorrect as described below

Note that if you want something that works for arbitrary AbstractString subtypes (not just UTF-8 encodings), you could use:

function firstdiff_indices(s1::AbstractString, s2::AbstractString)
    for ((i1, c1), (i2, c2)) in zip(pairs(s1), pairs(s2))
        c1 != c2 && return (i1, i2)
    end
    return nothing
end

(Note that in this case you need to return two indices in general, since s1 and s2 might have different indexing schemes.) It’s non-allocating, but is about 5x slower than the byte-scan method for String.

1 Like