Maybe the definition of simple can be discussed, if plain old for loop are an option, then the following code works, is readable and doesn’t allocate. (But it is not a one-liner )
function firstdiff(s1, s2)
if length(s1) != length(s2)
return min(length(s1), length(s2)) + 1
end
for (i,(c1,c2)) in enumerate(zip(s1,s2))
if c1 != c2
return i
end
end
return 0
end
(Oh, I just noted your profile name, so it wasn’t a beginner question. For sure that solution was obvious to you anyway )
julia> findfirst( a!=b for (a,b) in zip(s1,s2) )
ERROR: MethodError: no method matching keys(::Base.Iterators.Zip{Tuple{String, String}})
Closest candidates are:
keys(::IndexStyle, ::AbstractArray, ::AbstractArray...) at ~/julia/1.7.3/share/julia/base/abstractarray.jl:350
keys(::Tuple) at ~/julia/1.7.3/share/julia/base/tuple.jl:72
keys(::Tuple, ::Tuple...) at ~/julia/1.7.3/share/julia/base/tuple.jl:77
Note that for String you can probably do better (performance-wise) by comparing bytes in the codeunits(s1) and codeunits(s2) arrays, then converting the resulting byte index back to a string index with thisind.
That does the trick, works for the Unicode character strings too. Basically a shortened version of your original function, without the length check. That also means that when the lengths are different with one of them being a substring of the other, for eg. “julia” and “julialang”, it returns nothing - not sure if that’s okay or not for @rafael.guerra 's use case.
(Tangential, but I couldn’t find this return in a for loop documented in the manual section on loops or in REPL docstrings. Does it only work in global scope, where can I find more about it?)
Oh yeah, I was gonna mention that in a now-abandoned post. By string indices you mean byte indices I presume? If it’s something user-facing, graphemes may also be the thing to consider.
I mean indices that you can actually use to index into the string, i.e. an index i where s1[i] != s2[i] is valid, so you can use it for subsequent processing. Yes, technically this is a codeunit index (a byte index for String).
For example, this implementation is both faster than anything posted so far and is correct for Unicode (in that it returns a valid index or nothing), though it doesn’t take Unicode normalization into account:
const UTF8String = Union{String,SubString{String}}
function firstdiff_index(s1::UTF8String, s2::UTF8String)
c1, c2 = codeunits(s1), codeunits(s2)
@inbounds for i in 1:min(length(c1),length(c2))
c1[i] != c2[i] && return thisind(s1, i)
end
return nothing
end
If I’m not mistaken, @SteffenPL got the simplest solution for ASCII strings in post#8, but only @stevengj’s solution provides correct answers for Unicode strings. Thanks to all.
This is > 50\times slower than a loop, and is also somewhat different from the other solutions in that it fails if s1 and s2 differ in more than a single character, instead of returning the first mismatch.
Note that if you want something that works for arbitrary AbstractString subtypes (not just UTF-8 encodings), you could use:
function firstdiff_indices(s1::AbstractString, s2::AbstractString)
for ((i1, c1), (i2, c2)) in zip(pairs(s1), pairs(s2))
c1 != c2 && return (i1, i2)
end
return nothing
end
(Note that in this case you need to return two indices in general, since s1 and s2 might have different indexing schemes.) It’s non-allocating, but is about 5x slower than the byte-scan method for String.