In the discussion on GitHub: #27084 it seems people are gravitating towards setting the PCRE2_UCP
flag by default.
I would strongly recommend against that, and from their comments, it looks like the authors of PCRE2 also seem to feel the same way, see PCRE2 documentation.
By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match \D, \S, and \W, … These escape sequences retain their original meanings from before Unicode support was available, mainly for efficiency reasons.
Matching these sequences is noticeably slower when PCRE2_UCP is set.
Matching characters by Unicode property is not fast, because PCRE2 has to do a multistage table lookup in order to find a character’s property. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE2 by default, though you can make them do so by setting the PCRE2_UCP option or by starting the pattern with (*UCP).
Note that if you really want to take the performance hit, and use the Unicode tables, then you can simply add the
(*UCP)
at the beginning of the pattern, or if you only want to use the Unicode tables in certain parts of the regex, you can use the following equivalences:
\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore
i.e. instead of \d
, explicitly write \{Nd}
, instead of \s
, [\pZ\h\v]
, and instead of \w
, [\pL\pN_]
.
If that is deemed too inconvenient, we could simply preprocess the pattern, and make replacements such as:
(or whatever escape sequence would be easy to remember and use)
\ñ
=>\P{Nd}
(n for numeric, tilde to indicate Unicode)\Ñ
=>\P{Nd}
\á
=>[\pZ\h\v]
(whitespace)\Á
=>[^\pZ\h\v]
\í
=>[\pL\pN_]
(i for identifier, accent for Unicode)\Í
=>[^\pL\pN_]
The problem with making it the default, is that there is no easy way to disable it, if you want the best performance, and you are only concerned with matching ASCII characters anyway).
I made this more convenient for by adding a u
option (in addition to the i
, m
, s
, x
ones already available), for StrRegex.jl.