Regex, PCRE2, and the `PCRE2_UCP` / `(*UCP)` flag

strings

#1

In the discussion on GitHub: #27084 it seems people are gravitating towards setting the PCRE2_UCP flag by default.

I would strongly recommend against that, and from their comments, it looks like the authors of PCRE2 also seem to feel the same way, see PCRE2 documentation.

By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match \D, \S, and \W, … These escape sequences retain their original meanings from before Unicode support was available, mainly for efficiency reasons.

Matching these sequences is noticeably slower when PCRE2_UCP is set.

Matching characters by Unicode property is not fast, because PCRE2 has to do a multistage table lookup in order to find a character’s property. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE2 by default, though you can make them do so by setting the PCRE2_UCP option or by starting the pattern with (*UCP).

Note that if you really want to take the performance hit, and use the Unicode tables, then you can simply add the
(*UCP) at the beginning of the pattern, or if you only want to use the Unicode tables in certain parts of the regex, you can use the following equivalences:

\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore

i.e. instead of \d, explicitly write \{Nd}, instead of \s, [\pZ\h\v], and instead of \w, [\pL\pN_].
If that is deemed too inconvenient, we could simply preprocess the pattern, and make replacements such as:
(or whatever escape sequence would be easy to remember and use)

  • => \P{Nd} (n for numeric, tilde to indicate Unicode)
  • => \P{Nd}
  • => [\pZ\h\v] (whitespace)
  • => [^\pZ\h\v]
  • => [\pL\pN_] (i for identifier, accent for Unicode)
  • => [^\pL\pN_]

The problem with making it the default, is that there is no easy way to disable it, if you want the best performance, and you are only concerned with matching ASCII characters anyway).

I made this more convenient for by adding a u option (in addition to the i, m, s, x ones already available), for StrRegex.jl.


#2

AFAICT we just need to pass PCRE2_NEVER_UCP, which can be exposed to the user via a flag.


#3

IIUC, that just disallows setting UCP in the regex pattern via (*UCP), and doesn’t disable UCP if it’s been set in the options.


#4

Actually we can just avoid passing PCRE2_UCP when we get the flag, so I don’t see the problem.


#5

I think it boils down to:

  1. Should the much faster case be the default, given that much of the usage of regex strings will be in cases where you don’t need the Unicode tables
  2. Should the use of the tables and associated performance degradation be opt-in (as PCRE2 does) (by simply adding a ‘u’ to the regex options, as I’ve already done for StrRegex), or opt-out (as you have suggested)?
  3. Does Julia want to have it’s PCRE2 binding differ from the defaults of the well-documented and heavily used PCRE2 library itself?

In the last few years, has anybody before complained about Julia’s not using the Unicode tables by default for
\w, \s, and \d ?


#6

Also, a major point would be that changing the default for Julia regexes would be a breaking change,
one that could not easily be handled via a deprecation warning.


#7

I’d rather be Unicode-aware (and therefore more correct in most situations) by default, and faster with a flag. This is similar to what we do with e.g. @inbounds or @simd. The survey of other languages made in the issue has shown that there is no standard behavior across languages and regex libraries.


#8

Then, if it’s true that it would be a breaking change, don’t do it.


#9

Python, Rust, Go all have regex support in separate packages/crates/classes, not baked into the base language (which is what I believe should done for Julia before v1.0 is released).

For those, the defaults can be different depending on which library is being used for the regex support.

Programs using the PCRE2 library directly, expect the default to be ASCII for those 4 characters (and allow setting the PCRE2_UCP flag, or having (*UCP) at the beginning of a pattern).
Programs using the ICU library directly, word boundaries are ASCII by default, a flag can be set to allow slower/more complex Unicode aware word boundaries.
Swift uses the NSRegularExpression class (which uses ICU internally), word boundaries by default ASCII.
Java uses is like PCRE2 (and Julia up til now), ASCII by default, but has a regex compilation flag to make those 4 (and POSIX character classes) Unicode aware.


#10

AFAICT in ICU \w matches any Unicode word character, doesn’t it?


#11

Ah, my mistake!
I was thinking of the w option, in ICU, i.e.

Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.`

I should have gone back and RTFM to refresh my memory on that case! :joy:


#12

I don’t know; it seems pretty weird to consider non-ASCII characters non-{word, digit} by default. Why not support all human languages in Unicode by default? Code suddenly failing e.g. when one day a filename is not in English doesn’t seem like a good default. We can certainly add a flag to enable ASCII mode if desired though.


#13

Well, as an example, non-ASCII digits are problematic, usually people (even in other languages) don’t really want anything but 0-9, as well as taking the performance hit by default.
It might be good to have an option to toggle that separately.

Maybe have both an a and a u compile option flags, along with defaults based on the string type, would be OK.
So, ASCIIStr, Text1Str, Text2Str, Text4Str would not by default enable PCRE2_UCP, but could enable with u), and String, LatinStr, UTF8Str, UCS2Str etc. would (but could disable with a).

Does that seem like a reasonable approach?


#14

True, although nobody has complained yet about that happening in Julia, and the regex’s in base use \w, it should be fixed.


#15

Yes that seems reasonable. But is it possible/efficient to specify the option per string to match, and not just at regex compile time?


#16

True, but this is also a case in point — people expect \w to match word characters, not just English word characters!


#17

It’s not something that really depends on the string being matched (except that that can determine the default value to use)

You have to do tricks like I did in StrRegex, otherwise you’d be constantly recompiling.
It can be done though, it would need to compile the pattern both ways (probably lazily for the non-default cases).

I’ll make sure this is handled efficiently in StrRegex.

Basically, if you have an UTF8Str or UniStr, the default would be to have u set for the pattern, but not for ASCIIStr (similar to the PCRE2_UTF) flag, and at run-time, you pick the compiled regex to use based on the the type of the string being matched and the options given for the pattern, and if that field is C_NULL, you compile the regex, with those options, and the correct code unit size, and then call the appropriate match code.