Non greedy regex match

How can I match non-greedily a regex in julia ?

A general stackoverflow post mentions .*? but it doesn’t work.

Julia docs point to pcre2 site that mentions (?U) default ungreedy (lazy) but I am not sure how to use it.

Example

julia> str = """<a href="files/Zamren.gml">GML</a> <a href="files/Zamren.graphml">GraphML</a>""";

julia> reg = r"<a href=\"files/(.*?)\.graphml\">";

julia> match(reg, str)
RegexMatch("<a href=\"files/Zamren.gml\">GML</a> <a href=\"files/Zamren.graphml\">", 1="Zamren.gml\">GML</a> <a href=\"files/Zamren")

The matching result should be RegexMatch("<a href=\"files/Zamren.graphml\">", 1="Zamren")

PS

Please don’t suggest custom workarounds like:

julia> match(r"<a href=\"files/([^>]*)\.graphml\">", str)
RegexMatch("<a href=\"files/Zamren.graphml\">", 1="Zamren")

You shouldn’t be using RegEx to parse HTML

1 Like

True, but the example should still work, no?

Edit: actually, no, the behavior is perfectly sensible. The .*? pattern is non-greedy, but there’s really no way for the regex-engine to know that you don’t want the match to start as early as possible. Just because .*? occurs somewhere in the regex doesn’t mean “make the entire match as short as possible”.

The regex-engine is going to start from the left and find the initial <a href="files/. After that, it will keep adding the minimum number of letters (since .*? is indeed non-greedy) to complete the match, which gives you a total match of "<a href=\"files/Zamren.gml\">GML</a> <a href=\"files/Zamren.graphml\">". So, the syntax for non-greedy matches is indeed *?, but in this case, “non-greedy” doesn’t do what you think it does (find a shorter match later in the string). An easy mistake to make (took me a while, too), and why regexes are often quite tricky.

I would strongly second using a proper HTML parser instead of regexes.

4 Likes

ah. you are right. *? works indeed as expected in more appropriate situations.

julia> match(r"<a.*>", str)
RegexMatch("<a href=\"files/Zamren.gml\">GML</a> <a href=\"files/Zamren.graphml\">GraphML</a>")

julia> match(r"<a.*?>", str)
RegexMatch("<a href=\"files/Zamren.gml\">")

Thanks!

1 Like