When importing data, there are many “similar” labels, like:
“T-012”, “T-12”, “T12” and “T012”.
All are the same, must choose one and discard the others.
Levenshtein distance could resolve this?
And why Levenshtein in Julia was abandoned?
thanks!
I usually rely on regex to remove the redundancy. Julia has built-in regex support in case you know how to use it:
match(r"T[-0]*\d+", label)
If you are not familiar with regex syntax, you can take a look at:
https://github.com/jkrumbiegel/ReadableRegex.jl
Finally, there is
https://github.com/matthieugomez/StringDistances.jl
for distances between strings.
3 Likes
Unlikely. You’ll get the same distance between labels you think are similar and labels you think are distinct. For example (T-12,T12) and (T-12,T212) have the same distance. It’s better to rely on patterns you know are in similar labels as @juliohm suggests
2 Likes
thanks!
yes… the same distance in all cases… 1 char to change…
thanks!
FWIW, an alternative to regex:
uniquetag(s) = s[1:findfirst(!isletter,s)-1] * string(parse(Int,s[findfirst(isdigit,s):end]))
tags = ["T-012","T-12","T12","T012","S-0","S0","S01","S-1"]
uniquetag.(tags)
"T12"
"T12"
"T12"
"T12"
"S0"
"S0"
"S1"
"S1"