How to use Unicode characters in regex named groups?

Hello,

First, I have to say I am a new user of this language and I try to know if it is suited for my needs and style.

I would like to know how I can use Unicode regex named groups, because it seems it results in an error, contrary to Python (my main programming language).
Is this a desired behavior (I think it is inconsistent with the Unicode abilities of other parts of the language)? Is there a workaround (other regex engines, etc.)? Is there a good way to fix it (translate the names around the engine)?

Here is a minimal working example with associated error:

modèle = r"(?P<données>.*)"

ERROR: LoadError: LoadError: PCRE compilation error: syntax error in subpattern name (missing terminator) at offset 8
Stacktrace:
_ [1] error(::String) at ./error.jl:33_
_ [2] compile(::String, ::UInt32) at ./pcre.jl:103_
_ [3] compile(::Regex) at ./regex.jl:69_
_ [4] Regex(::String, ::UInt32, ::UInt32) at ./regex.jl:40_
_ [5] Regex(::String) at ./regex.jl:65_
_ [6] @r_str(::LineNumberNode, ::Module, ::Any, ::Vararg{Any,N} where N) at ./regex.jl:103_
_ [7] include at ./boot.jl:317 [inlined]_
_ [8] include_relative(::Module, ::String) at ./loading.jl:1041_
_ [9] include(::Module, ::String) at ./sysimg.jl:29_
_ [10] exec_options(::Base.JLOptions) at ./client.jl:229_
_ [11] start() at ./client.jl:421

It works like this, but this “workaround” is to avoid obsolutely:

modèle = r"(?P<donnees>.*)"

Process finished with exit code 0

Thank you.

Sincerely.

This sounds like a bug/limitation in the PCRE library to me. Its documentation only says that “Names consist of up to 32 alphanumeric characters and underscores”, without specifying whether only ASCII characters are accepted. You could file a bug there if you want this to be fixed.

1 Like

Hello.

Shortly after your answer, I sent a message to Philip Hazel, who kindly answered me, and some days ago he made a commit to fix that.

Names will match this pattern in UTF mode: [1][_\p{L}\p{Nd}]*\z

It should work in the next release (10.33), likely in a couple of months, he said.

Sincerely.


  1. _\p{L} ↩︎

1 Like

Thats great, thanks! File an issue against Julia once the new PCRE has been released so that we move to it.