Announcing SpineBasedRecordLinkage.jl, which links a set of tables to a central table known as a spine. If the spine doesn’t already exist the package will construct it from the input tables.
The package provides 3 functions:
run_linkageis used to construct a spine from one or more tables and link the tables to the spine. Alternatively, an existing spine can be passed and
run_linkagewill only perform the linkage step. A linkage run is configured in a YAML file and can run as a script, so that users needn’t write any Julia code.
summarise_linkage_runprovides a summary report of the results of a linkage run as a CSV file.
compare_linkage_runsprovides a summary comparison of 2 linkage runs as a CSV file.
The results can also be interrogated at the record level. That is, we get a record-level audit trail so that we can see, for each link, what criteria were satisfied to enable the link.
The next step is to implement a function that consolidates each spine record (where possible) using data from the set of records that are linked to it as well as some user-defined consolidation rules (again configured in a yaml file). Using the link and consolidation steps in an iterative fashion (link-consolidate-link-consolidate…) ought to yield a fairly stable/robust spine with linked data and an audit trail.
I am currently using this package on my commodity desktop with public health data totalling around 80 million rows so far. It’s working pretty well and takes under 15 minutes for a linkage run.
Feedback always welcome.