Find the position of a single non-matching character between two strings

rafael.guerra · August 30, 2022, 5:07pm

To find the position of a single non-matching character between two strings:

s1 = "part1_Z_part2.txt"
s2 = "part1_Y_part2.txt"

ix = findfirst([!=(c[1],c[2]) for c in zip(s1,s2)])

Is there a simpler way to write findfirst without creating the temporary vector? Is there a better way?

SteffenPL · August 30, 2022, 5:46pm

Maybe the definition of simple can be discussed, if plain old for loop are an option, then the following code works, is readable and doesn’t allocate. (But it is not a one-liner )

function firstdiff(s1, s2)

  if length(s1) != length(s2)
    return min(length(s1), length(s2)) + 1
  end

  for (i,(c1,c2)) in enumerate(zip(s1,s2))
    if c1 != c2
      return i
    end
  end

  return 0 
end

(Oh, I just noted your profile name, so it wasn’t a beginner question. For sure that solution was obvious to you anyway )

rafael.guerra · August 30, 2022, 5:54pm

It is definitely better and it is simple, but not simpler.

SteffenPL · August 30, 2022, 6:07pm

Ok, I have an obscure, (not simple), non-allocating one…

minimum( (a==b) ? typemax(Int64) : i for (i,(a,b)) in enumerate(zip(s1,s2)) )

[It’s somehow surprisingly funny/difficult to find a good one-liner given how simple the question is.]

DNF · August 30, 2022, 6:09pm

No access to computer here, but does it work with a generator instead of an array comprehension? I suspect it requires indexing, though.

digital_carver · August 30, 2022, 6:12pm

It doesn’t, as you suspect. findall needs keys to get the index to return, and keys isn’t defined for zip (related (stale) issue)

SteffenPL · August 30, 2022, 6:12pm

Somehow not…

julia> findfirst( a!=b for (a,b) in zip(s1,s2) )
ERROR: MethodError: no method matching keys(::Base.Iterators.Zip{Tuple{String, String}})
Closest candidates are:
  keys(::IndexStyle, ::AbstractArray, ::AbstractArray...) at ~/julia/1.7.3/share/julia/base/abstractarray.jl:350
  keys(::Tuple) at ~/julia/1.7.3/share/julia/base/tuple.jl:72
  keys(::Tuple, ::Tuple...) at ~/julia/1.7.3/share/julia/base/tuple.jl:77

SteffenPL · August 30, 2022, 6:15pm

With @digital_carver / @DNF hints of why zip makes trouble… so that would work and not allocate:

findfirst( s1[i] != s2[i] for i in 1:min(length(s1),length(s2)) )

digital_carver · August 30, 2022, 6:19pm

With the caveat that this only works for ASCII strings. For eg.

julia> s1 = "caféteria"; s2 = "cafémeria";

julia> findfirst( s1[i] != s2[i] for i in 1:min(length(s1),length(s2)) )
ERROR: StringIndexError: invalid index [5], valid nearby indices [4]=>'é', [6]=>'t'

rafael.guerra · August 30, 2022, 6:27pm

Actually as I haved asked this with file names in mind, that are similar but for one letter, I guess Unicode should be allowed too.

SteffenPL · August 30, 2022, 6:39pm

Good catch… Uff. Does this count? (Anyway, I should work on my thesis, so I will give up, curious what tricks people will find.)

ind = for (i,(a,b)) in enumerate(zip(s1,s2));  a == b || return i; end

stevengj · August 30, 2022, 6:47pm

Note that for String you can probably do better (performance-wise) by comparing bytes in the codeunits(s1) and codeunits(s2) arrays, then converting the resulting byte index back to a string index with thisind.

digital_carver · August 30, 2022, 6:52pm

That does the trick, works for the Unicode character strings too. Basically a shortened version of your original function, without the length check. That also means that when the lengths are different with one of them being a substring of the other, for eg. “julia” and “julialang”, it returns nothing - not sure if that’s okay or not for @rafael.guerra 's use case.

(Tangential, but I couldn’t find this return in a for loop documented in the manual section on loops or in REPL docstrings. Does it only work in global scope, where can I find more about it?)

stevengj · August 30, 2022, 6:52pm

Nope, none of the solutions in this thread so far have been correct for Unicode; they confuse character indices (ala enumerate) with string indices.

digital_carver · August 30, 2022, 6:56pm

Oh yeah, I was gonna mention that in a now-abandoned post. By string indices you mean byte indices I presume? If it’s something user-facing, graphemes may also be the thing to consider.

stevengj · August 30, 2022, 7:00pm

I mean indices that you can actually use to index into the string, i.e. an index i where s1[i] != s2[i] is valid, so you can use it for subsequent processing. Yes, technically this is a codeunit index (a byte index for String).

For example, this implementation is both faster than anything posted so far and is correct for Unicode (in that it returns a valid index or nothing), though it doesn’t take Unicode normalization into account:

const UTF8String = Union{String,SubString{String}}
function firstdiff_index(s1::UTF8String, s2::UTF8String)
    c1, c2 = codeunits(s1), codeunits(s2)
    @inbounds for i in 1:min(length(c1),length(c2))
        c1[i] != c2[i] && return thisind(s1, i)
    end
    return nothing
end

What would the user do with a grapheme index?

rafael.guerra · August 30, 2022, 7:06pm

If I’m not mistaken, @SteffenPL got the simplest solution for ASCII strings in post#8, but only @stevengj’s solution provides correct answers for Unicode strings. Thanks to all.

rocco_sprmnt21 · August 30, 2022, 7:10pm

findfirst(==(only(setdiff(s1,s2))), s1)

stevengj · August 30, 2022, 7:13pm

This is > 50\times slower than a loop, and is also somewhat different from the other solutions in that it fails if s1 and s2 differ in more than a single character, instead of returning the first mismatch.

Update: also incorrect as described below

stevengj · August 30, 2022, 7:17pm

Note that if you want something that works for arbitrary AbstractString subtypes (not just UTF-8 encodings), you could use:

function firstdiff_indices(s1::AbstractString, s2::AbstractString)
    for ((i1, c1), (i2, c2)) in zip(pairs(s1), pairs(s2))
        c1 != c2 && return (i1, i2)
    end
    return nothing
end

(Note that in this case you need to return two indices in general, since s1 and s2 might have different indexing schemes.) It’s non-allocating, but is about 5x slower than the byte-scan method for String.

Topic		Replies	Views
"Simpler" findfirst/firstall methods? Internals & Design	16	835	March 1, 2021
Dictionary from a String Sequence Iteration General Usage strings , dictionary	7	527	January 11, 2022
The correct use of findfirst General Usage strings	4	5887	March 26, 2019
Compare patterns in byte array New to Julia	4	511	February 19, 2021
Find position on list (fast) General Usage	13	779	November 22, 2021

Find the position of a single non-matching character between two strings

Related topics