Did Julia 1.6 introduced a regression for regex properties?

Hello.

It seems the latest Julia release (nice job, by the way!) introduced a strange behaviour for regex properties, especially with scripts…

text = "aa bb"
text |> collect
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

pattern = r"\p{Ll}+"
eachmatch(pattern, text) |> collect
# Good : 2-element Vector{RegexMatch}: RegexMatch("aa") RegexMatch("bb")

pattern = r"[\p{Ll}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good : 1-element Vector{RegexMatch}: RegexMatch("aa bb")

text = "壹貳 叁"
text |> collect
# '壹': Unicode U+58F9 (category Lo: Letter, other)
# '貳': Unicode U+8CB3 (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# '叁': Unicode U+53C1 (category Lo: Letter, other)

pattern = r"[\p{Han}]+"
eachmatch(pattern, text) |> collect
# Good 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")

pattern = r"[\p{Han} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")

text = "カ メ"
text |> collect
# 'カ': Unicode U+30AB (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'メ': Unicode U+30E1 (category Lo: Letter, other)

pattern = r"[\p{L}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")

pattern = r"[\p{Katakana}\s]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")

pattern = r"[\p{Katakana}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")

The Letter property works fine, but the script properties (like Han, Katakana, etc.) have a hard time mixed with Spaces property, contrary to the simple space character…

It seems it comes from Julia, not from PCRE2, because using directly PCRE2 10.35 (the same version Julia 1.6 seems to use) works fine:

PCRE2 version 10.35 2020-05-09
  re> "[\p{Ll}\p{Zs}]+"
data> "aa bb"
 0: aa bb

PCRE2 version 10.35 2020-05-09
  re> "(*UTF)[\p{Han}\p{Zs}]+"
data> "壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}
data> "(*UTF)壹貳 叁"
 0: \x{58f9}\x{8cb3} \x{53c1}

I struggle to understand where this problem comes from (which commit), but I am glad it works well on Julia 1.7 (at least for the moment). I did not find if this problem was found and fixed directly on the Julia repository (or if it was solved indirectly…).

I don’t know if we just need to wait for future 1.7 to have this problem fixed, or if it is useful to open a ticket (if necessary) for 1.6.x.

Sincerely.

2 Likes

I think opening a bug would be the correct thing to do as 1.0.5 behaves correctly:

julia> text = "壹貳 叁"
"壹貳 叁"

julia> pattern = r"[\p{Han}\p{Zs}]+"
r"[\p{Han}\p{Zs}]+"

julia> eachmatch(pattern, text) |> collect
1-element Array{RegexMatch,1}:
 RegexMatch("壹貳 叁")

julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)
2 Likes

I opened an issue.

I really did not find where it started to be broken and where in master branch it started to work again, because adding a bunch of tests for regex without a real target can be quite arduous.

1 Like