Similar strings

cesarmarinhorj · January 13, 2022, 7:09pm

When importing data, there are many “similar” labels, like:
“T-012”, “T-12”, “T12” and “T012”.
All are the same, must choose one and discard the others.
Levenshtein distance could resolve this?
And why Levenshtein in Julia was abandoned?
thanks!

juliohm · January 13, 2022, 7:14pm

I usually rely on regex to remove the redundancy. Julia has built-in regex support in case you know how to use it:

match(r"T[-0]*\d+", label)

If you are not familiar with regex syntax, you can take a look at:

https://github.com/jkrumbiegel/ReadableRegex.jl

Finally, there is

https://github.com/matthieugomez/StringDistances.jl

for distances between strings.

Jeff_Emanuel · January 13, 2022, 8:03pm

Unlikely. You’ll get the same distance between labels you think are similar and labels you think are distinct. For example (T-12,T12) and (T-12,T212) have the same distance. It’s better to rely on patterns you know are in similar labels as @juliohm suggests

cesarmarinhorj · January 13, 2022, 8:13pm

thanks!

cesarmarinhorj · January 13, 2022, 8:15pm

yes… the same distance in all cases… 1 char to change…
thanks!

rafael.guerra · January 13, 2022, 10:36pm

FWIW, an alternative to regex:

uniquetag(s) = s[1:findfirst(!isletter,s)-1] * string(parse(Int,s[findfirst(isdigit,s):end]))

tags = ["T-012","T-12","T12","T012","S-0","S0","S01","S-1"]
uniquetag.(tags)

 "T12"
 "T12"
 "T12"
 "T12"
 "S0"
 "S0"
 "S1"
 "S1"

Topic		Replies	Views
Approximate string matching for two lists of names in Julia General Usage question	8	559	April 3, 2024
DamerauLevenshtein() vs Levenshtein() why the same distance? General Usage	1	1439	September 10, 2020
Can't figure out why this code isn't working General Usage dictionary , string-dist	4	401	September 21, 2021
Entity resolution/duplicate data in Julia General Usage	10	1638	June 8, 2023
Help converting python to julia? New to Julia strings , regex	2	730	December 8, 2019

Similar strings

Related topics