Flaw in Regex support for String

No. It’s a much more serious problem (the issue the documentation you pointed out just talks about Regexes being mutated, but that can be worked around by creating the Regex objects directly and not using the r"…" macro).

What I noticed was the following (in pcre.jl):

const JIT_STACK = RefValue{Ptr{Cvoid}}(C_NULL)
const MATCH_CONTEXT = RefValue{Ptr{Cvoid}}(C_NULL)

Those would need to be allocated on a per-thread basis, or use a lock to control access, and can’t easily be worked around at the moment.

1 Like

A normal approach would be to put them in an array which should be relatively simple as when __init__ in PCRE is called we know the number of threads and this number cannot change later.

Yes, exactly. I’m not saying that it wouldn’t be simple, just that it’s another gotcha if you try to use threads at the moment.
Julia really needs a simple way (as other people such as Steven Johnson have asked for) to create something as thread-local storage, and make sure that accessing those variables is fast.

I meant when passing NO_CHECK_UTF for the match but not for the regex compilation (or for both since it doesn’t make a difference if the regex is valid).

AFAICT these links refer to issues where invalid UTF-8 strings are incorrectly considered as valid, which is clearly problematic. In Julia String provides no guarantees that its contents are valid UTF-8, so people should call isvalid or use a validated string type if they rely on it for safety or correctness.

There are also other issues listed in the Unicode Technical Report on security considerations.

That would be bad for performance (I’ve shown that with my Strs.jl package - it’s faster to do validity checking upfront and then be able to use optimizations not possible if you don’t know the strings are valid).
It’s also like saying that you should put a checkbounds call before every array access, if you care about safety or correctness, instead of the approach of allowing you to use @inbounds when you are sure that the access is valid (which is the same philosophy that the PCRE library takes with providing the NO_CHECK_UTF flag).

That seems to be rather a double standard - Julia checks validity for array access (unless you are sure and specifically turn it off), but requires you to check at run-time for string validity if you care about safety and correctness.

Besides the well-known security issues with allowing invalid sequences in UTF-8, UTF-16, or UTF-32 encodings, there’s a big performance hit by not checking validity up front, both because the validity check can be a lot faster,
(even without SIMD instructions, but with SIMD instructions, which PCRE’s validity checks don’t use, it can be quite a bit faster), and then further, because the processing is a lot simpler if you know you can’t have short sequences, and just looking at the first byte (or word for UTF-16), you know you can access the next byte(s) or word.

The PCRE2 code will definitely read up to 5 bytes past the end of the string if you input an invalid UTF-8 sequence such as \xfd at the end.

I frankly don’t think that the PCRE2 maintainers would be interested in trying to go against the IETF, W3C, and Unicode org recommendations to allow for invalid sequences, as has been suggested on GitHub.

Here are some other examples of bad behavior with the current incorrect setting NO_UTF_CHECK for compile and match:

julia> findfirst(r"\x80", "aas;ldfjasdlkffoo abc\xc2\x80")
22:22

julia> findfirst(r"\xc2", "aas;ldfjasdlkffoo abc\xc2\x80")

julia> findfirst(r"\C\xbf", "aas;ldfjasdlkffoo abc\xfd\xbf\xbf\xbf\xbf\xbf")
ERROR: LoadError: PCRE JIT error: no more memory
Stacktrace:
 [1] error at ./error.jl:33 [inlined]
 [2] jit_compile at ./pcre.jl:109 [inlined]
 [3] compile(::Regex) at ./regex.jl:56
 [4] Regex(::String, ::UInt32, ::UInt32) at ./regex.jl:30
 [5] Regex(::String) at ./regex.jl:51
 [6] @r_str(::LineNumberNode, ::Module, ::Any, ::Vararg{Any,N} where N) at ./regex.jl:83
in expression starting at REPL[14]:1

All of these examples involve invalid regexes.

And? So is the example in #26796, and you seemed to be calling for the ability to search for illegal sequences there.

Allowing invalid Unicode in substring search is very different than allowing invalid Unicode in regular expressions.

:+1: for checking pattern for validity on construction Regex in base (as I understand that this is the conclusion what should be done) and I guess packages can do whatever is sensible within their ecosystem. I will make a PR (https://github.com/JuliaLang/julia/pull/26802/files) so that it can be discussed there.

The example I gave with the overlong sequence at the end (or near enough to the end) happens no matter whether the regex is valid or not, and can still cause an access violation.

It’s not always faster. That is if you get away with no validity checking. Say if I implemented grep, or wc or line count.

To keep this on topic, I read Perl had the best UTF-8 (may have said Unicode) support, and when I checked it read in illegal bytes. I.e. didn’t check for BOM and do anything with it, just read as illegal UTF-8 bytes. So is it likely PCRE (made for/used by Perl? Or only compatible with?) allows illegal? Isn’t Perl as lax as Julia?

On arrays, yes sometimes you access them only just once more or never… but Julia is made for heavy repeated access to them. The bounds checks (without annotations) can often be optimized by the compiler outside of the loop, so I see not “double standard”. For heavy text processing on same string, it’s like the case when checking can’t be optimized away. Then you need to annotate. Julia could cache in a “for_sure__legal_UTF-8” bit kept with strings, so checking really only happens once.

Besides performance issues, there are a lot of well known security issues with allowing invalid UTF-8 sequences.

Security issues, because of the existence of an unchecked string type? Or because of using it in an security sensitive contexts? The latter would be much less of an argument against julias current behaviour - especially since no one is against having a package like CheckedAndSecureStrings.jl :wink:

1 Like

When you can have strings that are faster, generally take less space, are easier to use, and eliminate some of the known security issues (there are of course others that you really need to program defensively for, see the Unicode security recommendations), why would you even want to use Base String and Char? :wink:

Edit: These days, how can you know when some function will or will not be used in a security sensitive context?
This came up when discussing the hashing function used for strings in Julia (MurmurHash3), people said that it was important to make the Dict type safer against DOS attacks. Julia is going to be used for the new airplane navigation / collision avoidance stuff, right? Seems to me that being more security conscious would be a good thing.

Please bring up actual security issues as you find them.

I’ve pointed these out more than once: here are some well known ones (again!):
UTF-8 Exploits

None of those are actual exploitable issues in Julia.

Of course they are. People have written web applications in Julia, so any of the exploits that take advantage of invalid strings. For example:

Process A performs security checks, but does not check for non-shortest forms.
Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms.
The UTF-16 text may then contain characters that should have been filtered out by process A.

If Process A is written in Julia, and passes through bad byte sequences, but because the Julia comparison operators won’t be able to do the security checks correctly, if process B (or maybe on the same process, in the OS, a library call, a database connection, or something written in a JVM based language) would be vulnerable.

Note that there are outstanding PRs to try to fix a few of the comparison problems in Julia already.