Approximate string matching for two lists of names in Julia

00krishna · April 3, 2024, 2:02am

I have two lists of names that are approximately similar. So I wanted to figure out the best way to find the matches. For example, some sample data, the lists might look like:

list1 = ["Colby, James"; "Arnoldson, Robert J"; "Linkletter, Mary"]
list2 = ["Colby, J"; "Arnoldson, R J", "Jefferson, Robert", "Linkletter, Mary A"]

The output I would like to get is a final list of names where the list1 instance of a matched name is included in the output. So the output would be something like:

output_list = ["Colby, James"; "Arnoldson, Robert J"; "Linkletter, Mary", "Jefferson, Robert"]

There is a package entitled StringDistances.jl, which computes the distances between string using the Levenshtein, Jaro-Winkler, Jaccard, etc. distances, so that is a first step.

My question is, what is a good way to go about finding the optimal matches. That is, I could set a threshold on the Levenshtein distance, but that single metric might miss something. Also, it is hard to know what the optimal threshold is that gets the best set of matches with the least erroneous matches. I could also compute a few different distance metrics on each pair of strings and decide on a match depending on a consensus between metrics?

Can anyone suggest a good method to do this kind of matching? Julia seems good for this, since it is pretty fast :).

Thanks.

jar1 · April 3, 2024, 2:20am

To help your search, this is called “record linkage”. There have been some packages over the years.

adienes · April 3, 2024, 3:08am

I do something similar. I made a “blended” comparison metric

fuzzy_strcmp(a, b) = 1 / 10 * (5 * compare(a, b, Levenshtein()) + 3 * compare(a, b, RatcliffObershelp()) + 2 * compare(a, b, DamerauLevenshtein()))

with absolutely no rigor put into those coefficients – I just picked from the metrics available and made something up

then compute the pairwise distances for all strings in each list and iteratively extract the max similar match

rafael.guerra · April 3, 2024, 7:04pm

Linking related thread

mthelm85 · April 3, 2024, 7:22pm

I’ve done a fair amount of this kind of thing and have tried several different ways…a few things I’ve learned:

Invest in tidying up your strings as it pays off (e.g., remove punctuation, separate first/last names, convert everything to upper/lower case, etc.)
Take the time to figure out which string distance performs the best for the kinds of comparisons you are doing (e.g., Jaro-Winkler emphasizes the beginnings of the strings, which may or may not be good for your use case; look into the different modifiers in StringDistances.jl as well, because you can play around with different combinations of those)
Levenshtein and Damerau-Levenshtein are true metrics and therefore can be used to build tree structures (VPTrees.jl, BKTrees.jl, etc.). If you have a large number of comparisons, leveraging these structures can really speed things up. You’ll have to figure out the time it takes to build the trees and whether it’s actually faster for your particular use case…sometimes it is, sometimes it isn’t. If list1 is fairly static, for example, and you will be comparing against it frequently, putting it into a tree structure is almost certainly worth it.

Also, check out Phonetics.jl (hasn’t had any commits for quite some time, but it may still be useful).

For your MWE, a possible solution could be something like this:

[findnearest(s, list1, TokenMax(DamerauLevenshtein()); min_score=0.75)[1] for s in list2]

# output:

Union{Nothing, String} [
  "Colby, James",
  "Arnoldson, Robert J",
  nothing,
  "Linkletter, Mary"
]

rocco_sprmnt21 · April 3, 2024, 9:04pm

For this specific example or not very different situations perhaps we can do without bothering specific packages for textual analysis.
It might be enough to do something like this and adapt it to the case with the use of other basic functions.

using IterTools
st=sort(union(list1, list2))
dt=partition(st,2,1)
[(s[1],s[2]) for s in dt if issubset(s[1],s[2])||issubset(s[2],s[1])]

However, I’m curious to know what your actual case is.
Could you provide some richer examples?
PS
why do we choose “Linkletter, Mary” between these two “Linkletter, Mary”, “Linkletter, Mary A” ?

00krishna · April 3, 2024, 10:22pm

I looked for record linkage packages, but I believe those things require a number of fields to measure over right? I did not seem to find something that would match just on names, but I might not have been looking in the right place. But I will keep looking. Some of the Tree stuff seems interesting, as I have not seen that before. So I will keep researching.

00krishna · April 3, 2024, 10:23pm

Ahh, this is helpful. Yeah, I can see what you mean. I imagine I would need to normalize over the scores too, since they might be different scales. but this gives me a sense of how to do it. Thanks for the info.

00krishna · April 3, 2024, 10:29pm

This is very helpful, thank you. Yeah, this is good practical info. I have been working on separating last names and first names, and removing extra punctuation and such. I have never used those Tree structures before, so I need to research that some more. That is great info on how to pick the metrics as well. I suppose it will take a few attempts and some tuning to make it work. But I appreciate your giving a direction to point in. Thanks again.

Topic		Replies	Views
What's the best way to compare items in an array to every other item? General Usage performance , strings , arrays	11	3569	July 26, 2019
Similar strings New to Julia strings	5	633	January 13, 2022
Why do these two functions benchmark the same? Performance	2	1332	August 3, 2018
Entity resolution/duplicate data in Julia General Usage	10	1640	June 8, 2023
Jaccard distance of each combo of elements in 2 lists General Usage strings	2	412	November 2, 2020

Approximate string matching for two lists of names in Julia

Related topics