Approximate string matching for two lists of names in Julia

I have two lists of names that are approximately similar. So I wanted to figure out the best way to find the matches. For example, some sample data, the lists might look like:

list1 = ["Colby, James"; "Arnoldson, Robert J"; "Linkletter, Mary"]
list2 = ["Colby, J"; "Arnoldson, R J", "Jefferson, Robert", "Linkletter, Mary A"] 

The output I would like to get is a final list of names where the list1 instance of a matched name is included in the output. So the output would be something like:

output_list = ["Colby, James"; "Arnoldson, Robert J"; "Linkletter, Mary", "Jefferson, Robert"]

There is a package entitled StringDistances.jl, which computes the distances between string using the Levenshtein, Jaro-Winkler, Jaccard, etc. distances, so that is a first step.

My question is, what is a good way to go about finding the optimal matches. That is, I could set a threshold on the Levenshtein distance, but that single metric might miss something. Also, it is hard to know what the optimal threshold is that gets the best set of matches with the least erroneous matches. I could also compute a few different distance metrics on each pair of strings and decide on a match depending on a consensus between metrics?

Can anyone suggest a good method to do this kind of matching? Julia seems good for this, since it is pretty fast :).

Thanks.

To help your search, this is called “record linkage”. There have been some packages over the years.

1 Like

I do something similar. I made a “blended” comparison metric

fuzzy_strcmp(a, b) = 1 / 10 * (5 * compare(a, b, Levenshtein()) + 3 * compare(a, b, RatcliffObershelp()) + 2 * compare(a, b, DamerauLevenshtein()))

with absolutely no rigor put into those coefficients – I just picked from the metrics available and made something up

then compute the pairwise distances for all strings in each list and iteratively extract the max similar match

1 Like

Linking related thread

1 Like

I’ve done a fair amount of this kind of thing and have tried several different ways…a few things I’ve learned:

  • Invest in tidying up your strings as it pays off (e.g., remove punctuation, separate first/last names, convert everything to upper/lower case, etc.)
  • Take the time to figure out which string distance performs the best for the kinds of comparisons you are doing (e.g., Jaro-Winkler emphasizes the beginnings of the strings, which may or may not be good for your use case; look into the different modifiers in StringDistances.jl as well, because you can play around with different combinations of those)
  • Levenshtein and Damerau-Levenshtein are true metrics and therefore can be used to build tree structures (VPTrees.jl, BKTrees.jl, etc.). If you have a large number of comparisons, leveraging these structures can really speed things up. You’ll have to figure out the time it takes to build the trees and whether it’s actually faster for your particular use case…sometimes it is, sometimes it isn’t. If list1 is fairly static, for example, and you will be comparing against it frequently, putting it into a tree structure is almost certainly worth it.

Also, check out Phonetics.jl (hasn’t had any commits for quite some time, but it may still be useful).

For your MWE, a possible solution could be something like this:

[findnearest(s, list1, TokenMax(DamerauLevenshtein()); min_score=0.75)[1] for s in list2]

# output:

Union{Nothing, String} [
  "Colby, James",
  "Arnoldson, Robert J",
  nothing,
  "Linkletter, Mary"
]
2 Likes

For this specific example or not very different situations perhaps we can do without bothering specific packages for textual analysis.
It might be enough to do something like this and adapt it to the case with the use of other basic functions.

using IterTools
st=sort(union(list1, list2))
dt=partition(st,2,1)
[(s[1],s[2]) for s in dt if issubset(s[1],s[2])||issubset(s[2],s[1])]

However, I’m curious to know what your actual case is.
Could you provide some richer examples?
PS
why do we choose “Linkletter, Mary” between these two “Linkletter, Mary”, “Linkletter, Mary A” ?

1 Like

I looked for record linkage packages, but I believe those things require a number of fields to measure over right? I did not seem to find something that would match just on names, but I might not have been looking in the right place. But I will keep looking. Some of the Tree stuff seems interesting, as I have not seen that before. So I will keep researching.

Ahh, this is helpful. Yeah, I can see what you mean. I imagine I would need to normalize over the scores too, since they might be different scales. but this gives me a sense of how to do it. Thanks for the info.

This is very helpful, thank you. Yeah, this is good practical info. I have been working on separating last names and first names, and removing extra punctuation and such. I have never used those Tree structures before, so I need to research that some more. That is great info on how to pick the metrics as well. I suppose it will take a few attempts and some tuning to make it work. But I appreciate your giving a direction to point in. Thanks again.

1 Like