We can make alternation work in practice by specializing the +
operator:
Boring code
using Base: PCRE, wrap_string
# copypasta from regex * method, regex.jl line 804
function Base.:+(r1::Union{Regex,AbstractString,AbstractChar}, rs::Union{Regex,AbstractString,AbstractChar}...)
mask = PCRE.CASELESS | PCRE.MULTILINE | PCRE.DOTALL | PCRE.EXTENDED # imsx
match_opts = nothing # all args must agree on this
compile_opts = nothing # all args must agree on this
shared = mask
for r in (r1, rs...)
r isa Regex || continue
if match_opts === nothing
match_opts = r.match_options
compile_opts = r.compile_options & ~mask
else
r.match_options == match_opts &&
r.compile_options & ~mask == compile_opts ||
throw(ArgumentError("cannot multiply regexes: incompatible options"))
end
shared &= r.compile_options
end
unshared = mask & ~shared
Regex(join((wrap_string(r1, unshared), wrap_string.(rs, Ref(unshared))...), "|"), compile_opts | shared, match_opts)
end
Base.:+(r::Regex) = r
julia> x, y = r"abc", r"def"
(r"abc", r"def")
julia> match(x+y, "abc"), match(x+y, "def")
(RegexMatch("abc"), RegexMatch("def"))
julia> x*(x+y) # r"abc(?:abc|def)", equivalent to r"abcabc|abcdef"
r"(?:abc)(?:(?:abc)|(?:def))"
Could be nice to get it into Base
. As you can see, concatenation is distributive over alternation.
The unfortunate thing about these implementations for concatenation and alternation is, it’s not very efficient: it’s basically just converting the regex into a string, doing string manipulation, and then recompiling a new regex from the string. Furthermore this can’t be folded by the compiler. Probably in theory it could be much more efficient, but that might require deeper hooks into the PCRE library than we have (I have no clue). Maybe somebody will one day write a pure Julia PCRE library, in which case the sky would be the limit.
Also, I’m not the happiest about this:
julia> x+y, y+x
(r"(?:abc)|(?:def)", r"(?:def)|(?:abc)")
julia> ans[1] == ans[2] # lies!
false
the two regular expressions above are equivalent, in that they describe the same regular language and the same set of finite automata, but the equality tests for the PCRE library’s regular expressions are not very sophisticated.
Furthermore, we don’t currently have a way to express the Kleene star. We could possibly overload our only unary postfix operator Base.adjoint
, but that doesn’t feel right. Perhaps one day the parser will be modified so that (x*)
will call Base.star(x)
—then Kleene algebra will be fully supported IIUC. (oh how I wish our keyboards had ⋅
though, to disambiguate multiplication from convolution and Kleene stars.)
It’s definitely unintuitive, but to a compsci nerd it should be straightforward because it aligns with first-semester CS computability theory stuff.
I think the best way to grok it, is to think of the operators’ use in finite automata and regular expressions. This is a really good tutorial. And then, instead of thinking of *
as numerical multiplication, think of it more like logical multiplication “and,” &
or ∧
, and instead of thinking of +
as numerical addition, think of it like logical addition “or,” |
or ∨
.
In other words, a finite state machine which accepts the regular language described by the regular expression r"abc" * r"def"
will only accept strings that have "abc"
and "def"
(in that order). And (if we have alternation implemented like I’ve done above) the FSM described by the regular expression r"abc" + r"def"
will accept strings that have "abc"
or "def"
.
Then take the same operator rules for regular expressions and apply them to the strings they process. Alternation doesn’t have an analogous meaning for strings as for regular expressions, but concatenation does.
It’s arguable we should’ve used &
for concatenation and |
for alternation, but that ship has probably sailed. And Stephen Kleene isn’t around, so we can’t prod him to request the Kleene algebra to use these operators instead.
At the end of the day though, just like anything else, you get used to it.