Entity resolution/duplicate data in Julia

I think it would be a good idea to start brainstorming a framework for doing entity resolution across datasets in Julia. Please see this post for more info on what entity resolution is, and the Python package dedupe that performs it.

For smaller datasets, I’ve successfully used StringDistances.jl to simply compare every record in one dataset to every other record in another. This quickly becomes unfeasible when both datasets contain hundreds of thousands of rows. I’d like to start framing out a design that would leverage some of Julia’s strengths for doing performant entity resolution. The basic workflow for entity resolution typically looks like this (from the article linked to above):

  1. Deduplication: eliminating duplicate (exact) copies of repeated data.
  2. Record linkage: identifying records that reference the same entity across different sources.
  3. Canonicalization: converting data with more than one possible representation into a standard form.

#2 is obviously the most difficult step in the process and is what would require the most work. I don’t have any expertise in this realm so I’m hoping this thread will be a place to start putting together ideas as to how to do #2 for large datasets. Ideas/thoughts/brain dumps are below:

  • How can we leverage parellelization/GPU computing to carry out this task?

  • TextAnalysis.jl seems to be a fantastic package that already has quite a bit of functionality for handling some of the pre-processing (e.g., string cleaning, tokenizaton, etc.).

  • Does it make sense to recreate dedupe in Julia (vs. simply using PyCall) or should a package be built from scratch in order to leverage Julia’s strengths?

What do you all think? Is there anyone else in the community here that would benefit from being able to do entity resolution in a highly performant manner? Again, I don’t have much expertise in this area but I do have time to invest in it if I can get some guidance.

You may find this talk from last JuliaCon interesting.
It gives a well motivated and consistent method for doing data cleaning.
If there is a finite list of all canonical forms, then your problem 2/3 is a special case of data cleaning that is discussed in the video.

JuliaCon 2019 | Cleaning Messy Data with Julia and Gen | Alex Lew - YouTube.

2 Likes

You should be able to use Levenshtein Distance
with a NearestNeighbour.jl BallTree
You can’t right now because of: Levenshtein Distance is a true Metric, not a SemiMetric · Issue #19 · matthieugomez/StringDistances.jl · GitHub
but that should be an easy fix.
You can use BallTree with any distance measure that is a true metric.
(many string distance are only semimetrics)

I assume one of your sets of records is the canonical one?
If so you put that (after unique) into the ball-tree,
(if not you have to put both (after unioning)).

It is fast to find the k nearest neighbours within the ball tree to a point.
The point does not have to be in the ball-tree itself.

See docs of NearestNeighbour.jl

1 Like

Thanks so much for the feedback! I’m not too sure how to go about this but I’ll explore it and see if I can figure it out. From the NearestNeighbour.jl docs, it looks like I can just create my own Metric type so I’m going to see if I can code my own version of Levenshtein for this. I’m not really understanding this part though:

I assume one of your sets of records is the canonical one?
If so you put that (after unique) into the ball-tree

One of the datasets could be thought of as the canonical one and I can strip it down to just unique values, but how would I load that into a tree? Would I be loading a Vector{String} into the tree?

I believe so

For #2 Record linkage, see the new Julia packages SpineBasedRecordLinkage.jl and BayesianRecordLinkage.jl which are under active development.

SpineBasedRecordLinkage was announced previously on Julia Discourse here.

2 Likes

Hi, author of SpineBasedRecordLinkage.jl here.

You can remove duplicate records from a table by linking it to itself and retaining the resulting spine, which is a table in which each record specifies a unique entity from the input table; i.e., a table with duplicate records removed.

If you choose to match each column exactly you will get unique records in the result. More generally, you may choose to match some columns exactly, some approximately (using string distances for example), and some columns may not enter your matching criteria at all. That is, you have flexibility when deciding which columns to deduplicate and how.

For data canonicalization I am working on DataCleaning.jl, which aims to perform various data cleaning operations in a specified order (configured in yaml). Such operations include correcting spelling mistakes and standardising terms according to a user-supplied vocabulary (which includes expanding abbreviations, removing plurals, replacing adjectives with corresponding nouns, etc, as special cases). The package name is a working title, it may change. Suggestions welcome!

Also, I intend to leave hooks for probabilistic cleaning/linking methods, though I have no intention of implementing such methods myself. My experience suggests that one can go a long way towards clean data (ready for feature engineering and analysis) using only a sequence of deterministic operations. But it’d be great to enable further cleaning/linking using state of the art probabilistic methods, and I see no reason multiple approaches can’t fit into the same framework.

The ultimate intention is for Schemata.jl, SpineBasedRecordLinkage.jl and DataCleaning.jl to enable people who understand data but who are not necessarily programmers to go from a set of raw tables to a set of clean and linked tables using only yaml configuration and a set of domain-specific vocabularies (e.g., medical terminology in my use case). Note that this does not include feature engineering as I consider adding new columns to be the next step in the data science pipeline (happy to debate that assumption and test it on use cases other than my own).

Once this is all drafted I intend to have a github repo containing a bunch of self-contained how to’s, which will demonstrate various operations and combinations thereof. The de-duplication described above would be a good example of “How to de-duplicate records in a table”.

Finally, I should mention that these packages are being used where I work, namely the state health department in Victoria, Australia. That is, it is designed to be used by programmers and non-programmers alike (and could benefit from a GUI if anyone’s interested in developing that), and has so far been tested on data sets of around 80 million rows.

5 Likes

@aalexandersson Thanks so much for the info! @jocklawrie this looks fantastic! I’ll give it a go tomorrow and let you know if I run into any issues. I have three datasets right now that I need to compare, each with several hundred thousand records.

This doesn’t seem to work:

ERROR: MethodError: no method matching BallTree(::Array{String,1})

You’ll have to check the docs on the package, I haven’t used it in a few years.

Did you still interested yet?