Similar strings

When importing data, there are many “similar” labels, like:
“T-012”, “T-12”, “T12” and “T012”.
All are the same, must choose one and discard the others.
Levenshtein distance could resolve this?
And why Levenshtein in Julia was abandoned?
thanks!

I usually rely on regex to remove the redundancy. Julia has built-in regex support in case you know how to use it:

match(r"T[-0]*\d+", label)

If you are not familiar with regex syntax, you can take a look at:

https://github.com/jkrumbiegel/ReadableRegex.jl

Finally, there is

https://github.com/matthieugomez/StringDistances.jl

for distances between strings.

3 Likes

Unlikely. You’ll get the same distance between labels you think are similar and labels you think are distinct. For example (T-12,T12) and (T-12,T212) have the same distance. It’s better to rely on patterns you know are in similar labels as @juliohm suggests

2 Likes

thanks!

yes… the same distance in all cases… 1 char to change…
thanks!

FWIW, an alternative to regex:

uniquetag(s) = s[1:findfirst(!isletter,s)-1] * string(parse(Int,s[findfirst(isdigit,s):end]))

tags = ["T-012","T-12","T12","T012","S-0","S0","S01","S-1"]
uniquetag.(tags)

 "T12"
 "T12"
 "T12"
 "T12"
 "S0"
 "S0"
 "S1"
 "S1"