Problem with `replace` function making too many replacements

00krishna · July 13, 2024, 11:52pm

I have a dataset where one of the columns provides the names of university departments, such as “economics department” and “statistics department”, etc. However there are cases where some rows list the name as “statistics department” while other row list the name as “statistics dept”, etc. Hence I am trying to clean up and standardize the names a bit so that “statistics department” and “statistics dept” are not treated as 2 separate departments.

I am running into some trouble with my replacements where it seems like multiple replacements get applied to a single input value. This is leading to some weird outputs. I have an MWE below. I did use the count=1 argument in the function, but even then I am hitting this issue. I suppose there is some circularity in the replacements as I try to hit all the variations I can think of. Perhaps someone can think of a better way.

using DataFrames

d = [ "atm, oceanic & space sci.",
 "atm, oceanic and space sciences",
 "biologic and materials science",
 "biologic and materials sciences"]


rpl = [ " sci " => " sciences ", 
        " sci" => " sciences",
        " sci." => " sciences", 
        " science " => " sciences ",
        " science" => " sciences",
        " sciencess" => " sciences"]

map(x -> replace(x, rpl ..., count=1), d)

The output that I get looks like:

4-element Vector{String}:
 "atm, oceanic & space sciences."
 "atm, oceanic and space sciencesences"
 "biologic and materials sciencesence"
 "biologic and materials sciencesences"

Some replacements work correctly, while others end up with these strange overlapping replacements. Can anyone indicate where I am going wrong and how to fix it?

Thanks for the assistance.

savq · July 14, 2024, 3:32am

I’m not entirely sure what the problem is, but you can use a word boundary assertion \b to avoid dealing with spaces, which fixes the problem.

julia> depts = [
           "atm, oceanic & space sci.",
           "atm, oceanic and space sciences",
           "biologic and materials science",
           "biologic and materials sciences"
       ];

julia> replacements = [
           r"\bsci\b\.?" => "sciences",
           r"\bsciences*\b" => "sciences",
       ];

julia> map(str -> replace(str, replacements...), depts)
4-element Vector{String}:
 "atm, oceanic & space sciences"
 "atm, oceanic and space sciences"
 "biologic and materials sciences"
 "biologic and materials sciences"

00krishna · July 14, 2024, 4:18am

@savq that is very interesting, I had not seen that before. I don’t work with regular expressions that often, so I don’t know all the tricks. But I will certainly try this.

jules · July 15, 2024, 7:00pm

The problem is that " sci" matches all four entries, the first is only correct “by chance” because even ignoring the period, the output is right when the word is cut off anyway.

You have to order your replacement pairs such that the long ones come first, that way no match is the beginning of a later one.

But the word boundary is a better solution anyway.

DNF · July 15, 2024, 7:06pm

00krishna:

rpl = [ " sci " => " sciences ", 
        " sci" => " sciences",
        " sci." => " sciences", 
        " science " => " sciences ",
        " science" => " sciences",
        " sciencess" => " sciences"]

Instead of trying to find all versions of all strings combined with different leading and trailing spaces and punctuation, &, etc (leading to a combinatorial explosion), isn’t it better to divide the problem into several steps?

First remove leading, trailing, and multiple spaces. Remove unneeded punctuation, replace “&” with “and”, etc.

Only then turn “sci” into “sciences”, etc.

Topic		Replies	Views
Replacing multiple strings errors General Usage	11	4520	March 13, 2019
Perform multiple replacements on a string in a single pass Performance strings , regex	19	9532	January 18, 2022
Weird behaviour when replace the empty string in row with DataFramesMeta.jl General Usage strings , dataframes , dataframesmeta	4	378	March 16, 2023
Replace two substrings in a string with the same `replace` call? General Usage question , strings	2	329	October 20, 2021
Why isn't string replace variadic? General Usage	1	257	May 28, 2021

Problem with `replace` function making too many replacements

Related topics