I think it would be a good idea to start brainstorming a framework for doing entity resolution across datasets in Julia. Please see this post for more info on what entity resolution is, and the Python package dedupe that performs it.
For smaller datasets, I’ve successfully used StringDistances.jl to simply compare every record in one dataset to every other record in another. This quickly becomes unfeasible when both datasets contain hundreds of thousands of rows. I’d like to start framing out a design that would leverage some of Julia’s strengths for doing performant entity resolution. The basic workflow for entity resolution typically looks like this (from the article linked to above):
- Deduplication: eliminating duplicate (exact) copies of repeated data.
- Record linkage: identifying records that reference the same entity across different sources.
- Canonicalization: converting data with more than one possible representation into a standard form.
#2 is obviously the most difficult step in the process and is what would require the most work. I don’t have any expertise in this realm so I’m hoping this thread will be a place to start putting together ideas as to how to do #2 for large datasets. Ideas/thoughts/brain dumps are below:
-
How can we leverage parellelization/GPU computing to carry out this task?
-
TextAnalysis.jl seems to be a fantastic package that already has quite a bit of functionality for handling some of the pre-processing (e.g., string cleaning, tokenizaton, etc.).
-
Does it make sense to recreate dedupe in Julia (vs. simply using PyCall) or should a package be built from scratch in order to leverage Julia’s strengths?
What do you all think? Is there anyone else in the community here that would benefit from being able to do entity resolution in a highly performant manner? Again, I don’t have much expertise in this area but I do have time to invest in it if I can get some guidance.