I have a dataset where one of the columns provides the names of university departments, such as “economics department” and “statistics department”, etc. However there are cases where some rows list the name as “statistics department” while other row list the name as “statistics dept”, etc. Hence I am trying to clean up and standardize the names a bit so that “statistics department” and “statistics dept” are not treated as 2 separate departments.
I am running into some trouble with my replacements where it seems like multiple replacements get applied to a single input value. This is leading to some weird outputs. I have an MWE below. I did use the count=1 argument in the function, but even then I am hitting this issue. I suppose there is some circularity in the replacements as I try to hit all the variations I can think of. Perhaps someone can think of a better way.
using DataFrames
d = [ "atm, oceanic & space sci.",
"atm, oceanic and space sciences",
"biologic and materials science",
"biologic and materials sciences"]
rpl = [ " sci " => " sciences ",
" sci" => " sciences",
" sci." => " sciences",
" science " => " sciences ",
" science" => " sciences",
" sciencess" => " sciences"]
map(x -> replace(x, rpl ..., count=1), d)
The output that I get looks like:
4-element Vector{String}:
"atm, oceanic & space sciences."
"atm, oceanic and space sciencesences"
"biologic and materials sciencesence"
"biologic and materials sciencesences"
Some replacements work correctly, while others end up with these strange overlapping replacements. Can anyone indicate where I am going wrong and how to fix it?
@savq that is very interesting, I had not seen that before. I don’t work with regular expressions that often, so I don’t know all the tricks. But I will certainly try this.
The problem is that " sci" matches all four entries, the first is only correct “by chance” because even ignoring the period, the output is right when the word is cut off anyway.
You have to order your replacement pairs such that the long ones come first, that way no match is the beginning of a later one.
But the word boundary is a better solution anyway.
Instead of trying to find all versions of all strings combined with different leading and trailing spaces and punctuation, &, etc (leading to a combinatorial explosion), isn’t it better to divide the problem into several steps?
First remove leading, trailing, and multiple spaces. Remove unneeded punctuation, replace “&” with “and”, etc.