Issues with Regex support in Julia (threading, handling of abstract strings)

The current regex support in Julia is not thread safe, for a few reasons.
The first reason is that the underlying pcre.jl code creates a single MATCH_CONTEXT and a single JIT_STACK. As per PCRE2 documentation, those need to be per-thread.
The other, somewhat harder problem to deal with is that when the Regex object is created using the r"...", it looks like the same object will be shared among threads (I’m not yet 100% sure on that, I’m trying to write a test for that).

I am implementing a new wrapper for PCRE in JuliaString where I am trying to solve these two problems, as well as allowing it to handle the 8, 16, and 32 bit PCRE libraries, which I want to have built with BinaryBuilder and then use BinaryProvider to assure getting an up-to-date PCRE library (the one in master is 10.30, the current one is 10.31), as well as strings that are not UTF encoded more efficiently (for example, ASCII or Latin1 strings, 16-bit Unicode strings without any surrogates, etc).

I’d also like to recommend moving the regex support out of julia Base, and into stdlib, so that it will be easier to innovate, to be able to update when PCRE or the Unicode standard is updated, and to allow supporting different string types.
There are only 36 cases of r"..." in julia Base, in 11 files. They look like they are fairly simple patterns, which could be replaced by some handwritten code, so that Base would not be dependent on the PCRE library at all.
In the short term, dummy packages for PCRE and regex could be placed in stdlib, where any names could be exported, while leaving the PCRE and regex code in Base until Base is fully decoupled.

4 Likes