I believe there is a flaw with the Regex support in Base.
Since #24999, String does not necessarily have a valid UTF-8 string.
However, regex.jl has the following defaults:
In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.
Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check for the pattern; it does not also apply to subject strings. If you want to disable the check for a subject string you must pass this option to pcre2_match() or pcre2_dfa_match().
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result is undefined and your program may crash or loop indefinitely.
Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is given if an escape sequence for an invalid Unicode code point is encountered in the pattern. If you want to allow escape sequences such as \x{d800} (a surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible only in UTF-8 and UTF-32 modes, because these values are not representable in UTF-16.
Given the lack of interest that Iāve seen in general about strings in the Julia community, I also canāt see the benefit of rewriting a very good regex package in Julia (at least, not in the near future).
Also, there are many places in the PCRE code where setting NO_UTF_CHECK flag and passing invalid data could cause undefined behavior, for example, in places where the code uses values above > 0x10ffff as a marker (and I checked, the code they use to pick up UTF-8 values actually uses up to 6 bytes (i.e. the original UTF-8 spec), so any 32-bit values could incorrectly be confused with a marker (such as their NOTACHAR).
Another issue may be accessing past the end of tables, and causing an access violation.
Would you have examples where invalid UTF-8 causes crashes or at least incorrect behavior? That would be useful.
Please stop this kind of comments, Scott. Thatās really not helpful, and given the time people like Stefan have spent on the design of String itās clearly inaccurate. Iād rather stop that conversation and close my PR if you donāt want to pursue it in a constructive way.
I didnāt say reporting problems was unhelpful, I was referring to complains about the lack of attention given to strings, which you keep repeating while I was precisely trying to fix the bug you reported in that thread.
Thanks for the example. On my branch which enables PCRE UTF-8 checks, it throws an error. So Iām curious how it gives the correct result with Strs. Didnāt you suggest it should be an error?
I said there was a lack of interest (when I ask about strings on Gitter, mostly I get the responses of the sort āthat sounds nice, but I donāt do any processing on stringsā) and said nothing at all about the attention given.
If I had used the UTF8Str type, it would give an error (before ever getting into the PCRE code), as it should.
In that example, I used the Text1Str type, which is for unvalidated strings (with 1-byte code units), to show how PCRE can be used to correctly search for sequences in non-UTF strings, if desired.
OK. But then I guess one cannot use the full power of regexes (e.g. with Unicode character categories) when telling PCRE that the string isnāt valid Unicode?
You are simply telling PCRE that the string has not been validated, and that PCRE must perform itās own validation checks on the input string, to avoid undefined behavior or possibly access violations (for example, trying to access the tables for those character categories with something that is outside the valid ranges 0-0xd7ff, 0xe000-0x10ffff). The flag does not say that the string isnāt valid Unicode, only that it might be.
What is the real use case for doing Regex searches on a string with invalid sequences?
Doing so just seems to perpetuate āgarbage in, garbage outā (besides allowing the security holes that the Unicode org warns about).
I see two ways that this can be handled well (and following the Unicode org recommendations):
If you are looking for valid sequences, and wish to use Unicode character categories, convert the string to valid Unicode first (i.e. to UTF8Str replacing invalid sequences with '\ufffd'
If you want to match the invalid sequences, then take the string as a Text1Str, and match the invalid byte sequence directly, as I showed above.
Itās more than that AFAICT: it also implies that a character is always a single code units, so that e.g. . will match a byte, not a codepoint. So in practice it only works if you specify bytes and expect byte results (for example when matching ASCII content).
Yes, thatās the whole point: avoid throwing errors if the string contains invalid Unicode. If people want to validate their strings, itās easy to do and a having a validated UTF-8 string type makes perfect sense. But sometimes one needs to preserve the exact input (e.g. file names, contents of a file). Granted, it matters less for regex matching and it looks like itās much harder to ensure, but it doesnāt mean it makes no sense either.
You are confusing the two flags, UTF and NO_UTF_CHECK. UTF must be specified as part of the compilation options, in order to do UTF-8 matching (or UTF-16 with PCRE built for 16-bit). NO_UTF_CHECK just means that the input (either for compilation or for matching) has already been checked for validity (which speeds things up by avoiding those checks on every search)
Again, why would you want a āgarbage in, garbage outā situation, and allow known security holes?
If you want to preserve exact input, then in Strs.jl I have provide simple means of doing so without giving up the safety and performance advantages that using only validated strings when you expect UTF-8 (or UTF-16 or UTF-32, for that matter).
Simply use unsafe_str, and youāll get back either a validated string, or a Text1Str (or Text2Str/Text4Str, if your input was a vector of UInt16 or UInt32), or if you know in advance you want to preserve it exactly, simply use Text1Str (or BinaryStr for binary data) to begin with.
Yes, I was speaking about UTF. Which flag does Strs.jl set to get the above result? That canāt be NO_UTF_CHECK since Base also sets it and the result is different.
Text1Str, which does not assume any encoding, does not set the UTF flag, although you can get that if you want it to be treated as UTF by specifying it via (*UTF) at the beginning of the regex pattern.
For example:
julia> y = Text1Str("this is a test of \uff \U1f596")
"this is a test of ĆĀæ Ć°\u9f\u96\u96"
julia> fnd(First, Regex(Text1Str("(*UTF)\U1f596")), y)
22:25
My regex handling doesnāt work for all cases yet, I still need to figure out out to build the PCRE2 library for 16 and 32 bits with the new BinaryProviderBinaryBuilder packages, to finish that part of my project up.
I just noticed another serious issue with the current Regex support, that will either need to be fixed or at the very least well documented, that it is not thread safe.
Have you found examples of incorrect behavior when the regex is valid UTF-8, but the string is not? That would be interesting, since supporting invalid regexes does not sound as useful as supporting invalid strings.
Iām not sure I understand your question - the regex must be valid UTF (8 or 16), if you have the UTF compile flag set (which also means that the haystack is considered to be UTF, and will be checked for validity unless you put the NO_CHECK_UTF flag on for the match).
Are you aware of other libraries that donāt follow the Unicode org recommendations, that allow processing of invalid strings in the way you want? Note: Boost security notice from 2013 Rust handling of invalid UTF-8 IETF RFC3629 security recommendations
(and numerous others)
I think the problems with doing so are fairly well-known among people who do a lot of work with string processing / conversions / encoding.