Did Julia 1.6 introduced a regression for regex properties?

D_A · March 27, 2021, 12:42pm

Hello.

It seems the latest Julia release (nice job, by the way!) introduced a strange behaviour for regex properties, especially with scripts…

text = "aa bb"
text |> collect
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

pattern = r"\p{Ll}+"
eachmatch(pattern, text) |> collect
# Good : 2-element Vector{RegexMatch}: RegexMatch("aa") RegexMatch("bb")

pattern = r"[\p{Ll}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good : 1-element Vector{RegexMatch}: RegexMatch("aa bb")

text = "壹貳 叁"
text |> collect
# '壹': Unicode U+58F9 (category Lo: Letter, other)
# '貳': Unicode U+8CB3 (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# '叁': Unicode U+53C1 (category Lo: Letter, other)

pattern = r"[\p{Han}]+"
eachmatch(pattern, text) |> collect
# Good 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")

text = "カ メ"
text |> collect
# 'カ': Unicode U+30AB (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'メ': Unicode U+30E1 (category Lo: Letter, other)

pattern = r"[\p{L}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana}\s]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")

pattern = r"[\p{Katakana}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")

The Letter property works fine, but the script properties (like Han, Katakana, etc.) have a hard time mixed with Spaces property, contrary to the simple space character…

It seems it comes from Julia, not from PCRE2, because using directly PCRE2 10.35 (the same version Julia 1.6 seems to use) works fine:

PCRE2 version 10.35 2020-05-09
  re> "[\p{Ll}\p{Zs}]+"
data> "aa bb"
 0: aa bb

PCRE2 version 10.35 2020-05-09
  re> "(*UTF)[\p{Han}\p{Zs}]+"
data> "壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}
data> "(*UTF)壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}

I struggle to understand where this problem comes from (which commit), but I am glad it works well on Julia 1.7 (at least for the moment). I did not find if this problem was found and fixed directly on the Julia repository (or if it was solved indirectly…).

I don’t know if we just need to wait for future 1.7 to have this problem fixed, or if it is useful to open a ticket (if necessary) for 1.6.x.

Sincerely.

pixel27 · March 27, 2021, 1:32pm

I think opening a bug would be the correct thing to do as 1.0.5 behaves correctly:

julia> text = "壹貳 叁"
"壹貳 叁"

julia> pattern = r"[\p{Han}\p{Zs}]+"
r"[\p{Han}\p{Zs}]+"

julia> eachmatch(pattern, text) |> collect
1-element Array{RegexMatch,1}:
 RegexMatch("壹貳 叁")

julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)

D_A · March 30, 2021, 7:08pm

I opened an issue.

I really did not find where it started to be broken and where in master branch it started to work again, because adding a bunch of tests for regex without a real target can be quite arduous.

Topic		Replies	Views
Regex matching when string has non-ascii unicode General Usage strings , regex , unicode	3	2763	October 26, 2021
Strange regex error (bug?) General Usage regex	3	495	July 25, 2022
Flaw in Regex support for String Internals & Design strings , regex	38	3136	May 4, 2018
Regex, PCRE2, and the `PCRE2_UCP` / `(*UCP)` flag Internals & Design strings , regex	16	1645	May 17, 2018
How to use Unicode characters in regex named groups? General Usage strings , regex	3	862	February 12, 2019

Did Julia 1.6 introduced a regression for regex properties?

Related topics