Perform multiple replacements on a string in a single pass

I have a couple of scripts which perform multiple replacements on the same string. It seems as though this can be sped up by searching for all potential matches. For example, something like

julia>replace("123",[r"1" => s"a",r"2" => s"b"])
"ab3"

To be clear, that isn’t real output. Or maybe something with an or logic. Note that putting a . after replace outputs an array with each individual substitution. ie

julia> replace.("123",[r"1" => s"a",r"2" => s"b"])
2-element Array{String,1}:
 "a23"
 "1b3"

which is not what I want.

Currently what I’ve been doing is

str = read(file,String)
for sub in [ list of pairs ]
    str = replace(str,sub)
end

which loops over all the substitution pairs that I need performed. But looping seems inefficient when it seems like it should be possible to perform all subs as you go through the file.

4 Likes

Don’t think this exists because it won’t generalize. What happens when two substitutions collide?

1 Like

To do this kind of thing in a single pass, you could do something like:

julia> subs = Dict("1" => "a", "2" => "b");

julia> replace("123", r"1|2" => s -> subs[s])
"ab3"
3 Likes

What I would expect is that the first substitution that matches is used, or an error is throw. Seems intuitive to me.

First or last takes precedence?

That certainly looks a lot nicer, however in my test it’s much slower.

julia> str = read("file",String)
julia> function test(str)
       subbed = str
       for rs in [
       "e" => "1"
       "t" => "2"
       ]
       subbed = replace(subbed,rs)
       end
       end

julia> function test2(str)
       subs = Dict("e" =>"1", "t" => "2")
       subbed = replace(str, r"e|t" => s -> subs[s])
       end
julia> @time test2(str);
  0.004076 seconds (68.29 k allocations: 2.935 MiB)
julia> @time test(str);
  0.002202 seconds (13 allocations: 319.844 KiB)

The test file being used is about 136kB.

What about if you start substituting your own substitutions though?

eg

replace("123",[r"1" => s"a", r"a2" => s"12"])

Yes, regular expressions allocate a bunch of temporary objects on each match to represent matches and substrings (which is why it is showing so many more allocations for test2), but the advantage of a single regex pass should increase if you have more and longer strings.

(Conversely, if you are really only doing single-character replacements, it will probably be more efficient to use a single map call.)

1 Like

One thing I always expect from replacement methods is that it acts on the original string, and not in a partially changed string. The moment a replacement happens the next character the algorithm look at must be the first character after the replacement (i.e., a character from the original string). I think this behavior is common to all replacement methods I have used in my life.

That returns an error.

@stevengj

Some of the regex expressions are convoluted and exceed 80 characters and I’m matching things ranging from single characters to about ten and I’m subbing them with strings ranging from 5 to 20 characters. I’m not sure if that is the sort of range you’re talking about, but I’ll see how it works in my main problem and report back.

Thanks for the clarification.

Also, if I may, where might I go to unravel what your suggestion actually does? I tried @less but that just took me to the definition of replace. This is a bit beyond me and I’d like to use this as a learning experience as well as a practical answer if possible.

I don’t think that is necessarily correct (starting the second replacement from the character following the spot where the first took place). That would imply that

replace("123",[r"2" => s"b", r"1" => s"a"])

would only do the first sub and not the second.

Anyway, perhaps maybe there are ways to generalize this and I just don’t see it.

Which parts don’t you understand?

I’m not understanding what is going on behind the scene, so-to-speak.

Or at least I’m not certain. I think I could translate that to psudo-code, but the internal workings overall confuse me a bit.

It seems like replace starts by searching through the file until it comes upon a match in the regex. Then it performs a look up in the dictionary and uses that before moving on.

My question, I guess is, what’s going on with the anonymous function. How is the function getting it’s input? What is telling the function “we stopped at ‘e’ so look up the appropriate substitution”? Yes, that is my problem. I don’t see how the matched expression is being passed to the function. Or at least that’s one of them.

Julia calls a regular-expression library (PCRE) that returns each time it encounters a match. Given information about the match (its locations in the string), Julia passes a corresponding SubString (a type of string) argument containing the matched substring to your anonymous function, which computes the substitution string. Then Julia calls PCRE to find the next match, etcetera.

No. You understood me wrong.

The original string is scanned from the first character to the last character. The first regex in the array that matches the current character (until the end of the match) is applied. After this the algorithm keep scanning the original string from the next character that was not replaced by this last match, but it continues to check for matchs starting from the first element of the regex array to the last element of the regex array. This is, for any give character the match-checking always happen in order, and if a match is found the match-checking is interrupted, the replacement is made, the next character to look is the first after the replacement, and the first match to check is again the first match of the regex array not the index in which it stopped last time.

Consequently, both replacements would happen in your example. For the first character the first match does not trigger but the second does. For the second character of the original string the first match triggers, because all matches are always checked, and they are checked in the order given.

1 Like

I actually was hoping/searching for such a function numerous times.

The reason is probably at least twofold
A) I do not know much about regex
B) When developing code, I often clean/parse/amend strings in a stepwise manner (ie I may want to see the intermediate results on-screen (for selected elements).

Usually I don’t even have a for loop but simply a few lines with replace calls.
I guess that works just fine, but I also wondered why the function does not allow to submit multiple pairs

1 Like

As of Julia 1.7, multiple replacements are now supported: Base.replace

For example:


julia> replace("abcabc", "a" => "b", "b" => "c", r".+" => "a")
"bca"
5 Likes

Perfect, this is exactly the behaviour that I argued in favor in this thread.

1 Like

For regex noobs, could you explain what this does in the context of the example? Thank you.

Absolutely, can do. First off, the r prefix means that the string literal will be parsed with the @r_str macro which will convert the string into a Regex, so the regular expression itself is just .+. The dot matches any character, and the plus is a repetition operator meaning “repeat one or more times,” so overall this Regex means “match any character one or more times”.

It looks like they picked this example in the docs to show that all of the replacements are happening together. If this example was the same as repeated calls to the replace function, then this final replacement would just replace any string with one or more characters with a single "a", but as you can see, the actual output starts with "bc" instead.

Overall, the replacements look like this:

The first character of the string is compared with the first character of the first pattern, in this case there’s a match, and that happens to be the whole of the first pattern, so "b" is appended to the output:

input:  abcabc
        ^
output: b

Next, the second character is compared against the first character of the first pattern. This isn’t a match, so the second character is compared against the first character of the second pattern. This is a match, so that character is replaced:

input:  abcabc
         ^
output: bc

Finally, the third character doesn’t match the first or second pattern, but the Regex will match any character, so it does so, matching the whole rest of the string:

input:  abcabc
          ^^^^
output: bca

Since the Regex has already matched up until the end of the string, the first and second patterns never get a chance to match the second "a" or "b", so that’s the end. An improvement to this example might be to include some characters in the source string which are not replaced.

2 Likes

@DouglasLivingstone, thanks for the brilliant and clear explanation. Really appreciated.

1 Like