Regex matching when string has non-ascii unicode

This is not e-with-grave in Unicode. You are using an obsolete Latin1 encoding, probably Windows 1252, not Unicode (\x is byte escaping, not Unicode character escaping). Native Julia strings are Unicode (though you can convert from other encodings using StringEncodings.jl).

In Unicode, e-with-grave is U+00E8, which can be entered as

julia> s = "caf\ue8"
"cafè"

(or simply as s = "cafè"). Notice that it prints correctly with the accent, unlike "caf\xe8". Regex then works fine:

julia> eachmatch(r".", s) |> collect
4-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("è")

Beware that there are subtleties with accented characters in Unicode, because they can typically be encoded in two canonically equivalent ways: as an accented character like U+00E8, or as an unaccented letter followed by a “combining character” encoding the accent. In particular, the encoding

julia> s = "cafe\u0300"
"cafè"

using the U+0300 combining accent, is an equivalent way (according to Unicode) to express "cafè", but it contains 5 “characters” (Unicode codepoints) instead of 4. Regex still works, but then the accent gets matched separately:

julia> eachmatch(r".", s) |> collect
5-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("e")
 RegexMatch("̀")

(This is a property of Unicode, not Julia.)

11 Likes