Regex matching when string has non-ascii unicode

I’m using regex to split a string and getting unexpected results when the input data contains non-ascii characters (specifically, a unicode \xe8). In python I can do:

>>> import re
>>> s = "caf\xe8"
>>> re.findall('.', s)
['c', 'a', 'f', 'è']

But in Julia (1.7rc2) I’m not able to match the e-with-grave:

julia> s = "caf\xe8"
"caf\xe8"

julia> eachmatch(r".", s) |> collect
3-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")

julia> eachmatch(r"\X", s) |> collect
3-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")

Am I doing something wrong, or is this a bug in the PCRE module that I should report?
thanks,
Steve

This is not e-with-grave in Unicode. You are using an obsolete Latin1 encoding, probably Windows 1252, not Unicode (\x is byte escaping, not Unicode character escaping). Native Julia strings are Unicode (though you can convert from other encodings using StringEncodings.jl).

In Unicode, e-with-grave is U+00E8, which can be entered as

julia> s = "caf\ue8"
"cafè"

(or simply as s = "cafè"). Notice that it prints correctly with the accent, unlike "caf\xe8". Regex then works fine:

julia> eachmatch(r".", s) |> collect
4-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("è")

Beware that there are subtleties with accented characters in Unicode, because they can typically be encoded in two canonically equivalent ways: as an accented character like U+00E8, or as an unaccented letter followed by a “combining character” encoding the accent. In particular, the encoding

julia> s = "cafe\u0300"
"cafè"

using the U+0300 combining accent, is an equivalent way (according to Unicode) to express "cafè", but it contains 5 “characters” (Unicode codepoints) instead of 4. Regex still works, but then the accent gets matched separately:

julia> eachmatch(r".", s) |> collect
5-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("e")
 RegexMatch("̀")

(This is a property of Unicode, not Julia.)

11 Likes

ah, that explains it, thanks! Reading the input data through StringEncodings.jl worked, now the regex matches successfully

I should mention that this can be handled by using \X instead of . in your regex — the \X escape in a regular expression matches a Unicode “grapheme”, which corresponds more closely to a human-perceived “character”. (e.g. it includes a letter followed by any number of combining modifiers.)

For example:

julia> s = "cafe\u0300"
"cafè"

julia> eachmatch(r"\X", s) |> collect
4-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("è")

which contains 4 matches as expected — but the final RegexMatch("è") match actually consists of a string "e\u0300" of two characters (two Unicode codepoints).

2 Likes