Flaw in Regex support for String

I believe there is a flaw with the Regex support in Base.
Since #24999, String does not necessarily have a valid UTF-8 string.
However, regex.jl has the following defaults:

const DEFAULT_COMPILER_OPTS = PCRE.UTF | PCRE.NO_UTF_CHECK | PCRE.ALT_BSUX
const DEFAULT_MATCH_OPTS = PCRE.NO_UTF_CHECK

Per the PCRE manual (https://www.pcre.org/current/doc/html/pcre2unicode.html):

In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.

Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check for the pattern; it does not also apply to subject strings. If you want to disable the check for a subject string you must pass this option to pcre2_match() or pcre2_dfa_match().

If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result is undefined and your program may crash or loop indefinitely.

Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is given if an escape sequence for an invalid Unicode code point is encountered in the pattern. If you want to allow escape sequences such as \x{d800} (a surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible only in UTF-8 and UTF-32 modes, because these values are not representable in UTF-16.

2 Likes

Good catch. I have made a pull request to fix that.

2 Likes

In the PR, I saw the comment:

match(r"a", ā€œ\U11000ā€) is supposed to fail according to the docs (look for PCRE2_ERROR_UTF8_ERR14), but it doesnā€™t. Am I missing something?

It is simply missing another 0 (sort of typo we all make!)

julia> '\U11000'
'š‘€€': Unicode U+011000 (category Mc: Mark, spacing combining)

julia> '\U110000'
ERROR: syntax: invalid escape sequence

julia> match(r"a", "\U110000")
ERROR: syntax: invalid escape sequence

Thanks, that was a typo in the PCRE docs. Iā€™ve updated the PR.

In your comment on the PR, you stated:
ā€œitā€™s rather that we would like to allow working with invalid strings without ever throwing an error, and unfortunately PCRE doesnā€™t seem to share that objective at allā€
I would say instead that the PCRE developers have followed the recommendations of the Unicode organization, as Julia really should be doing also (there are very good security reasons, as well as performance ones, for doing so)

Given the lack of interest that Iā€™ve seen in general about strings in the Julia community, I also canā€™t see the benefit of rewriting a very good regex package in Julia (at least, not in the near future).

Also, there are many places in the PCRE code where setting NO_UTF_CHECK flag and passing invalid data could cause undefined behavior, for example, in places where the code uses values above > 0x10ffff as a marker (and I checked, the code they use to pick up UTF-8 values actually uses up to 6 bytes (i.e. the original UTF-8 spec), so any 32-bit values could incorrectly be confused with a marker (such as their NOTACHAR).
Another issue may be accessing past the end of tables, and causing an access violation.

Would you have examples where invalid UTF-8 causes crashes or at least incorrect behavior? That would be useful.

Please stop this kind of comments, Scott. Thatā€™s really not helpful, and given the time people like Stefan have spent on the design of String itā€™s clearly inaccurate. Iā€™d rather stop that conversation and close my PR if you donā€™t want to pursue it in a constructive way.

6 Likes

I already explained where it would give bad behavior, Iā€™m sorry you didnā€™t feel that is helpful.

Here is a simple example of getting an unexpected answer:

julia> findnext(r"\xfd\xbf\xbf\xbf\xbf\xbf", String(b"abc \xfd\xbf\xbf\xbf\xbf\xbf 123"), 1)

julia> findnext("\xfd\xbf\xbf\xbf\xbf\xbf", String(b"abc \xfd\xbf\xbf\xbf\xbf\xbf 123"), 1)
5:10

julia> findnext("123", String(b"abc \xfd\xbf\xbf\xbf\xbf\xbf 123"), 1)
12:14

On the other hand, the Strs package doesnā€™t have this bug:

julia> using Strs

julia> r = Regex(Text1Str("\xfd\xbf\xbf\xbf\xbf\xbf"))
r"ļæ½ļæ½ļæ½ļæ½ļæ½ļæ½"

julia> a = Text1Str(b"abc \xfd\xbf\xbf\xbf\xbf\xbf 123")
"abc Ć½ĀæĀæĀæĀæĀæ 123"

julia> fnd(First, r, a)
5:10

I didnā€™t say reporting problems was unhelpful, I was referring to complains about the lack of attention given to strings, which you keep repeating while I was precisely trying to fix the bug you reported in that thread.

Thanks for the example. On my branch which enables PCRE UTF-8 checks, it throws an error. So Iā€™m curious how it gives the correct result with Strs. Didnā€™t you suggest it should be an error?

2 Likes

I think you misinterpreted what I said.

I said there was a lack of interest (when I ask about strings on Gitter, mostly I get the responses of the sort ā€œthat sounds nice, but I donā€™t do any processing on stringsā€) and said nothing at all about the attention given.

If I had used the UTF8Str type, it would give an error (before ever getting into the PCRE code), as it should.
In that example, I used the Text1Str type, which is for unvalidated strings (with 1-byte code units), to show how PCRE can be used to correctly search for sequences in non-UTF strings, if desired.

OK. But then I guess one cannot use the full power of regexes (e.g. with Unicode character categories) when telling PCRE that the string isnā€™t valid Unicode?

You are simply telling PCRE that the string has not been validated, and that PCRE must perform itā€™s own validation checks on the input string, to avoid undefined behavior or possibly access violations (for example, trying to access the tables for those character categories with something that is outside the valid ranges 0-0xd7ff, 0xe000-0x10ffff). The flag does not say that the string isnā€™t valid Unicode, only that it might be.

What is the real use case for doing Regex searches on a string with invalid sequences?
Doing so just seems to perpetuate ā€œgarbage in, garbage outā€ (besides allowing the security holes that the Unicode org warns about).

I see two ways that this can be handled well (and following the Unicode org recommendations):

  1. If you are looking for valid sequences, and wish to use Unicode character categories, convert the string to valid Unicode first (i.e. to UTF8Str replacing invalid sequences with '\ufffd'
  2. If you want to match the invalid sequences, then take the string as a Text1Str, and match the invalid byte sequence directly, as I showed above.

Itā€™s more than that AFAICT: it also implies that a character is always a single code units, so that e.g. . will match a byte, not a codepoint. So in practice it only works if you specify bytes and expect byte results (for example when matching ASCII content).

Yes, thatā€™s the whole point: avoid throwing errors if the string contains invalid Unicode. If people want to validate their strings, itā€™s easy to do and a having a validated UTF-8 string type makes perfect sense. But sometimes one needs to preserve the exact input (e.g. file names, contents of a file). Granted, it matters less for regex matching and it looks like itā€™s much harder to ensure, but it doesnā€™t mean it makes no sense either.

You are confusing the two flags, UTF and NO_UTF_CHECK.
UTF must be specified as part of the compilation options, in order to do UTF-8 matching (or UTF-16 with PCRE built for 16-bit).
NO_UTF_CHECK just means that the input (either for compilation or for matching) has already been checked for validity (which speeds things up by avoiding those checks on every search)

Again, why would you want a ā€œgarbage in, garbage outā€ situation, and allow known security holes?

If you want to preserve exact input, then in Strs.jl I have provide simple means of doing so without giving up the safety and performance advantages that using only validated strings when you expect UTF-8 (or UTF-16 or UTF-32, for that matter).
Simply use unsafe_str, and youā€™ll get back either a validated string, or a Text1Str (or Text2Str/Text4Str, if your input was a vector of UInt16 or UInt32), or if you know in advance you want to preserve it exactly, simply use Text1Str (or BinaryStr for binary data) to begin with.

Yes, I was speaking about UTF. Which flag does Strs.jl set to get the above result? That canā€™t be NO_UTF_CHECK since Base also sets it and the result is different.

Text1Str, which does not assume any encoding, does not set the UTF flag, although you can get that if you want it to be treated as UTF by specifying it via (*UTF) at the beginning of the regex pattern.
For example:

julia> y = Text1Str("this is a test of \uff \U1f596")
"this is a test of ƃĀæ Ć°\u9f\u96\u96"

julia> fnd(First, Regex(Text1Str("(*UTF)\U1f596")), y)
22:25

My regex handling doesnā€™t work for all cases yet, I still need to figure out out to build the PCRE2 library for 16 and 32 bits with the new BinaryProvider BinaryBuilder packages, to finish that part of my project up.

I just noticed another serious issue with the current Regex support, that will either need to be fixed or at the very least well documented, that it is not thread safe.

Have you found examples of incorrect behavior when the regex is valid UTF-8, but the string is not? That would be interesting, since supporting invalid regexes does not sound as useful as supporting invalid strings.

1 Like

Iā€™m not sure I understand your question - the regex must be valid UTF (8 or 16), if you have the UTF compile flag set (which also means that the haystack is considered to be UTF, and will be checked for validity unless you put the NO_CHECK_UTF flag on for the match).
Are you aware of other libraries that donā€™t follow the Unicode org recommendations, that allow processing of invalid strings in the way you want? Note:
Boost security notice from 2013
Rust handling of invalid UTF-8
IETF RFC3629 security recommendations
(and numerous others)

I think the problems with doing so are fairly well-known among people who do a lot of work with string processing / conversions / encoding.

Do you mean the behavior documented in https://docs.julialang.org/en/latest/manual/parallel-computing/#The-@threads-Macro-1 or something else?