Hello.
It seems the latest Julia release (nice job, by the way!) introduced a strange behaviour for regex properties, especially with scripts…
text = "aa bb"
text |> collect
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
# 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
pattern = r"\p{Ll}+"
eachmatch(pattern, text) |> collect
# Good : 2-element Vector{RegexMatch}: RegexMatch("aa") RegexMatch("bb")
pattern = r"[\p{Ll}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good : 1-element Vector{RegexMatch}: RegexMatch("aa bb")
text = "壹貳 叁"
text |> collect
# '壹': Unicode U+58F9 (category Lo: Letter, other)
# '貳': Unicode U+8CB3 (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# '叁': Unicode U+53C1 (category Lo: Letter, other)
pattern = r"[\p{Han}]+"
eachmatch(pattern, text) |> collect
# Good 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")
pattern = r"[\p{Han}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("壹貳") RegexMatch("叁")
pattern = r"[\p{Han} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("壹貳 叁")
text = "カ メ"
text |> collect
# 'カ': Unicode U+30AB (category Lo: Letter, other)
# ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
# 'メ': Unicode U+30E1 (category Lo: Letter, other)
pattern = r"[\p{L}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
pattern = r"[\p{Katakana} ]+"
eachmatch(pattern, text) |> collect
# Good: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
pattern = r"[\p{Katakana}\s]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")
pattern = r"[\p{Katakana}\p{Zs}]+"
eachmatch(pattern, text) |> collect
# Good on Julia 1.5 and 1.7 dev: 1-element Vector{RegexMatch}: RegexMatch("カ メ")
# Bad on Julia 1.6: 2-element Vector{RegexMatch}: RegexMatch("カ") RegexMatch("メ")
The Letter property works fine, but the script properties (like Han, Katakana, etc.) have a hard time mixed with Spaces property, contrary to the simple space character…
It seems it comes from Julia, not from PCRE2, because using directly PCRE2 10.35 (the same version Julia 1.6 seems to use) works fine:
PCRE2 version 10.35 2020-05-09
re> "[\p{Ll}\p{Zs}]+"
data> "aa bb"
0: aa bb
PCRE2 version 10.35 2020-05-09
re> "(*UTF)[\p{Han}\p{Zs}]+"
data> "壹貳 叁"
0: \x{58f9}\x{8cb3} \x{53c1}
data> "(*UTF)壹貳 叁"
0: \x{58f9}\x{8cb3} \x{53c1}
I struggle to understand where this problem comes from (which commit), but I am glad it works well on Julia 1.7 (at least for the moment). I did not find if this problem was found and fixed directly on the Julia repository (or if it was solved indirectly…).
I don’t know if we just need to wait for future 1.7 to have this problem fixed, or if it is useful to open a ticket (if necessary) for 1.6.x.
Sincerely.