Regex matching when string has non-ascii unicode

sleak · October 25, 2021, 1:29am

I’m using regex to split a string and getting unexpected results when the input data contains non-ascii characters (specifically, a unicode \xe8). In python I can do:

>>> import re
>>> s = "caf\xe8"
>>> re.findall('.', s)
['c', 'a', 'f', 'è']

But in Julia (1.7rc2) I’m not able to match the e-with-grave:

julia> s = "caf\xe8"
"caf\xe8"

julia> eachmatch(r".", s) |> collect
3-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")

julia> eachmatch(r"\X", s) |> collect
3-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")

Am I doing something wrong, or is this a bug in the PCRE module that I should report?
thanks,
Steve

stevengj · October 25, 2021, 2:57am

This is not e-with-grave in Unicode. You are using an obsolete Latin1 encoding, probably Windows 1252, not Unicode (\x is byte escaping, not Unicode character escaping). Native Julia strings are Unicode (though you can convert from other encodings using StringEncodings.jl).

In Unicode, e-with-grave is U+00E8, which can be entered as

julia> s = "caf\ue8"
"cafè"

(or simply as s = "cafè"). Notice that it prints correctly with the accent, unlike "caf\xe8". Regex then works fine:

julia> eachmatch(r".", s) |> collect
4-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("è")

Beware that there are subtleties with accented characters in Unicode, because they can typically be encoded in two canonically equivalent ways: as an accented character like U+00E8, or as an unaccented letter followed by a “combining character” encoding the accent. In particular, the encoding

julia> s = "cafe\u0300"
"cafè"

using the U+0300 combining accent, is an equivalent way (according to Unicode) to express "cafè", but it contains 5 “characters” (Unicode codepoints) instead of 4. Regex still works, but then the accent gets matched separately:

julia> eachmatch(r".", s) |> collect
5-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("e")
 RegexMatch("̀")

(This is a property of Unicode, not Julia.)

sleak · October 25, 2021, 5:22am

ah, that explains it, thanks! Reading the input data through StringEncodings.jl worked, now the regex matches successfully

stevengj · October 26, 2021, 1:04pm

I should mention that this can be handled by using \X instead of . in your regex — the \X escape in a regular expression matches a Unicode “grapheme”, which corresponds more closely to a human-perceived “character”. (e.g. it includes a letter followed by any number of combining modifiers.)

For example:

julia> s = "cafe\u0300"
"cafè"

julia> eachmatch(r"\X", s) |> collect
4-element Vector{RegexMatch}:
 RegexMatch("c")
 RegexMatch("a")
 RegexMatch("f")
 RegexMatch("è")

which contains 4 matches as expected — but the final RegexMatch("è") match actually consists of a string "e\u0300" of two characters (two Unicode codepoints).

Topic		Replies	Views
How to use Unicode characters in regex named groups? General Usage strings , regex	3	857	February 12, 2019
Flaw in Regex support for String Internals & Design strings , regex	38	3135	May 4, 2018
Regex, PCRE2, and the `PCRE2_UCP` / `(*UCP)` flag Internals & Design strings , regex	16	1644	May 17, 2018
Help converting python to julia? New to Julia strings , regex	2	730	December 8, 2019
Regular expressions returning offsets in bytes not characters General Usage question , regex	9	1278	July 7, 2017

Regex matching when string has non-ascii unicode

Related topics