Match a string literal via regex

Just like '"([^\\"]+|\\.)*?"' in Python or "\G\"([^\\\"]+|\\\.)*?\"" in C#/F#, it could build a regular expression to parse a whole text like

"xxxxxxxxxx\"xxxxxxxxxxxxxx\""

However things seem not work in Julia.

regex_s = raw"\"([^\\\"]+|\\\.)*?\""
regex = Regex(raw"\G" * regex_s)
str = repr("xxxxxxxxxx\"xxxxxxxxxxxxxx\"")
match(r, z)
# RegexMatch("\"xxxxxxxxxx\\\"", 1="xxxxxxxxxx\\")

Any workaround?

Not sure if this is what you want?

julia> str = repr("xxxxxxxxxx\"xxxxxxxxxxxxxx\"")
"\"xxxxxxxxxx\\\"xxxxxxxxxxxxxx\\\"\""

julia> collect(m.match for m in eachmatch(r"[^\"]+", str))
2-element Array{SubString{String},1}:
 "xxxxxxxxxx\\"    
 "xxxxxxxxxxxxxx\\"

I want to match a String/SubString to check if it starts with with a string("...\"..."), and if so I extract it out of the head of String/SubString(then I’ll get a new SubString).

m = match(str_regex, str)
if m === nothing
   @fail # fail and jump out of here
else
   token = m.match
   push!(tokens, token)
   str = SubString(str, length(token))
end

Also, a concrete use case is a parser generator with automatic lexers

I don’t really understand your question in the OP (I don’t know well python/c# regexes, and you have some typos), but if I understand the title correctly, you can quote a string by enclosing it with \Q and \E, e.g. Regex(raw"\Q" * str * raw"\E").

Sorry that I’m not to match a specific string.
Wen writing string literals in our codes, we write a " firstly, then followed by a sequence of characters, and finally we’d write another " to end this process.

str = "<a sequence>"

Furthermore, when we want to represent a string contains "s, we have to escape them in this way: "xx\"xx".

Note that our source codes are not special, I mean they’re still plain text.
So, how do programming language compilers parse the literal strings?
One way is using regular expression, which is quite mature and capable of expressing/matching escapes and quotations.
My problem is Julia didn’t work in this scope.

are you looking for this?

julia> str_regex = r"(...\"...)(.*)"
r"(...\"...)(.*)"

julia> test = "...\"... abc"
"...\"... abc"

julia> m = match(str_regex, test)
RegexMatch("...\"... abc", 1="...\"...", 2=" abc")

julia> m.captures[2]
" abc"

thank you, but not this.
Given a text file, whose content looks like

"this is a str\"ing"

Then read it into Julia, so how can we match it?

If I understand you correctly, you want to match a “string literal”, not match a “literal string”. (You could edit your post’s title.)

Languages are subtly different on what escaping rules apply; see e.g. https://github.com/JuliaLang/julia/issues/22926 for a nice discussion about a (fixed) corner case in Julia. Do you have any specific language you want to emulate?

1 Like

This one might be helpful to what you’re trying to achieve: http://wordaligned.org/articles/string-literals-and-regular-expressions . After all, your question is about regular expressions much more than it is about Julia.

1 Like

your question is about regular expressions much more than it is about Julia.

Somewhat don’t agree. I know how to write this regex but don’t know why I fail this when with Julia.

You are right. I think what’s tripping you up is more likely than not what I linked above: behaviour for quotes following slashes inside raw strings. Specifically, you have \\" inside your character class, and that’s interpreted differently in Julia than elsewhere.

Here’s the easiest way I found to get it to work. I tried to make it extra readable by using the x modifier, which allows whitespace and comments as follows:

julia> str = repr("xxxxxxxxxx\"xxxxxxxxxxxxxx\"");

julia> regex = r"""
       \G     # match start
       \"     # opening quote
       (?:    # don't capture (better performance)
           [^\"\\]+  # not a quote or a slash
           |         # or
           \\.       # an escaped character
       )*?    # ungreedy multiples of the above
       \"     # closing quote
       """x;

julia> match(regex, str)
RegexMatch("\"xxxxxxxxxx\\\"xxxxxxxxxxxxxx\\\"\"", 1="\\\"")

Hope that helps!

5 Likes

Awesome! Also thanks for solving this in such an elegant way!