Regex replace by group index

When doing string replace with regex, I want to know which capture group matched, so I can perform replacement specific to that group.

For example, if matched at word boundary, the letter should be replaced with X, and if followed by c, the letter should be replaced with _:

# "abac" => "Xb_c"
tr = ["X", "_"]
replace("abac", r"\b(\w)|(\w)(?=c)" => f)
f(s) = tr[1] # ?

Here f would be given the string "a" in both matches, and I have no indication of which group matched. Splitting replace into multiple runs is not possible, since the context for each match will change.

Currently I am working around this by providing custom method

Base._replace(io::IO, repl_s::_MyT, str, r, re::Base.RegexAndMatchData) = begin
    n = Base.PCRE.substring_length_bynumber(re.match_data, 1)
    ...

but I don’t feel at ease by patching internal methods.

Is there a better way? For reference, other languages would provide full Match object to f in this case.

Welcome to Julia! I’m not the best person to answer, and I don’t think I have your answer here, but I’ll do what I can to start.

It doesn’t seem to be quite what you’re after, but the SubstitutionString possibility in replace looks like it could maybe relevant.

I don’t know if it’s applicable to your actual problem, but in this simple example you can just apply multiple replacement patterns:

julia> replace("abac", r"\b(\w)" => "X", r"(\w)(?=c)" => "_")
"Xb_c"

There might also be something you could do with eachmatch, but it might be some work to plumb that into a replace-like function to actually make the substitution based on what it matched.

# The RegexMatches contain more info than they show here, although some is likely internal.
# Try calling `dump` on one.
julia> eachmatch(r"\b(\w)|(\w)(?=c)", "abac") |> collect
2-element Vector{RegexMatch}:
 RegexMatch("a", 1="a", 2=nothing)
 RegexMatch("a", 1=nothing, 2="a")

There might be room to add a feature to replace where you could make the function be passed the entire RegexMatch object rather than only the SubString. That seems like it is maybe what it should have been to begin with.

I also suspect there is a way to use replace, but alternatively, you could use findall. It seems a bit trickier than I had originally imagined though:

julia> function replace_regex(str, regex_replacements)
           # e.g. for str = "abbc", regex_replacements = [r"(\w)(?=c)" => "_", r"\b(\w)" => "X"]
           ranges_replacements = map(regex_replacements) do (rgx, repl)
               return findall(rgx, str) .=> repl
           end  # Vector{Vector{Pair}}, e.g. [[2:2 => "_", 3:3 => "_"], [1:1 => "X"]]
           ranges_replacements = sort(reduce(vcat, ranges_replacements))  # Vector{Pair}, e.g. [1:1 => "X", 2:2 => "_", 3:3 => "_"]
           # Note: if the ranges overlap, things will go wrong. 
           #
           new_str = str
           for (range, repl) in Iterators.reverse(ranges_replacements)
               # Iterate from back to front, so that the ranges remain valid.
               new_str = new_str[begin:range.start-1] * repl * new_str[range.stop+1:end]
           end
           return new_str
       end;

julia> replace_regex("abac", [r"\b(\w)" => "X", r"(\w)(?=c)" => "_"])
"Xb_c"

julia> replace_regex("abcc", [r"(\w)(?=c)" => "_", r"\b(\w)" => "X"])
"X__c"

julia> replace_regex("abzc", [r"z" => "c", r"(\w)(?=c)" => "_"])  # Different from replace_regex("abcc", [r"(\w)(?=c)" => "_"]): "a__c"
"ab_c"

To augment original example, given the same regex r"\b(\w)|(\w)(?=c)", the replacement should be

"abac" => "Xb_c"
"ebec" => "Yb:c"
"oboc" => "Zb.c"

(actually dozens more)

Iterating over eachmatch seems to be equivalent to what I’m doing if I change regex a little to r"\b(\w)|(\w)(?=c)|(.)", I’ll give it a try. Thank you.

If all you needed was a simple library of replacements you can consider something like

julia> wordboundarydict = Dict("a" => "X", "e" => "Y", "o" => "Z"); # replacements after word boundary

julia> beforecdict = Dict("a" => "_", "e" => ":", "o" => "."); # replacements before c

julia> substitute(dict) = key -> get(dict, key, key); # access a key if possible, else returns the key

julia> replace("abac", r"\b(\w)" => substitute(wordboundarydict), r"(\w)(?=c)" => substitute(beforecdict))
"Xb_c"

julia> replace("ebec", r"\b(\w)" => substitute(wordboundarydict), r"(\w)(?=c)" => substitute(beforecdict))
"Yb:c"

julia> replace("oboc", r"\b(\w)" => substitute(wordboundarydict), r"(\w)(?=c)" => substitute(beforecdict))
"Zb.c"

julia> replace("xbec", r"\b(\w)" => substitute(wordboundarydict), r"(\w)(?=c)" => substitute(beforecdict)) # no replacement for x
"xb:c"

But otherwise it sounds like you have a plan you can use based on eachmatch. Good luck!

EDIT: I finally took a look into the Base._replace function you were modifying. Comments around that file suggest that packages might extend some of those functions. This suggests they probably won’t change a lot, although I don’t see anything quite suggesting a guarantee. All-in-all, it looks like your approach there (with your custom types to avoid piracy) was mostly okay if that ends up being the nicest way.

Splitting original regex into parts and using multiple patterns interface works for me fine.