I’m using regex to split a string and getting unexpected results when the input data contains non-ascii characters (specifically, a unicode \xe8). In python I can do:
>>> import re
>>> s = "caf\xe8"
>>> re.findall('.', s)
['c', 'a', 'f', 'è']
But in Julia (1.7rc2) I’m not able to match the e-with-grave:
This is not e-with-grave in Unicode. You are using an obsolete Latin1 encoding, probably Windows 1252, not Unicode (\x is byte escaping, not Unicode character escaping). Native Julia strings are Unicode (though you can convert from other encodings using StringEncodings.jl).
In Unicode, e-with-grave is U+00E8, which can be entered as
julia> s = "caf\ue8"
"cafè"
(or simply as s = "cafè"). Notice that it prints correctly with the accent, unlike "caf\xe8". Regex then works fine:
Beware that there are subtleties with accented characters in Unicode, because they can typically be encoded in two canonically equivalent ways: as an accented character like U+00E8, or as an unaccented letter followed by a “combining character” encoding the accent. In particular, the encoding
julia> s = "cafe\u0300"
"cafè"
using the U+0300 combining accent, is an equivalent way (according to Unicode) to express "cafè", but it contains 5 “characters” (Unicode codepoints) instead of 4. Regex still works, but then the accent gets matched separately:
I should mention that this can be handled by using \X instead of . in your regex — the \X escape in a regular expression matches a Unicode “grapheme”, which corresponds more closely to a human-perceived “character”. (e.g. it includes a letter followed by any number of combining modifiers.)
which contains 4 matches as expected — but the final RegexMatch("è") match actually consists of a string "e\u0300" of two characters (two Unicode codepoints).