I have two lists of names that are approximately similar. So I wanted to figure out the best way to find the matches. For example, some sample data, the lists might look like:
list1 = ["Colby, James"; "Arnoldson, Robert J"; "Linkletter, Mary"]
list2 = ["Colby, J"; "Arnoldson, R J", "Jefferson, Robert", "Linkletter, Mary A"]
The output I would like to get is a final list of names where the list1
instance of a matched name is included in the output. So the output would be something like:
output_list = ["Colby, James"; "Arnoldson, Robert J"; "Linkletter, Mary", "Jefferson, Robert"]
There is a package entitled StringDistances.jl
, which computes the distances between string using the Levenshtein, Jaro-Winkler, Jaccard, etc. distances, so that is a first step.
My question is, what is a good way to go about finding the optimal matches. That is, I could set a threshold on the Levenshtein distance, but that single metric might miss something. Also, it is hard to know what the optimal threshold is that gets the best set of matches with the least erroneous matches. I could also compute a few different distance metrics on each pair of strings and decide on a match depending on a consensus between metrics?
Can anyone suggest a good method to do this kind of matching? Julia seems good for this, since it is pretty fast :).
Thanks.