[ANN] SpineBaseRecordLinkage.jl

Hi all,

Announcing SpineBasedRecordLinkage.jl, which links a set of tables to a central table known as a spine. If the spine doesn’t already exist the package will construct it from the input tables.

The package provides 3 functions:

  • run_linkage is used to construct a spine from one or more tables and link the tables to the spine. Alternatively, an existing spine can be passed and run_linkage will only perform the linkage step. A linkage run is configured in a YAML file and can run as a script, so that users needn’t write any Julia code.
  • summarise_linkage_run provides a summary report of the results of a linkage run as a CSV file.
  • compare_linkage_runs provides a summary comparison of 2 linkage runs as a CSV file.

The results can also be interrogated at the record level. That is, we get a record-level audit trail so that we can see, for each link, what criteria were satisfied to enable the link.

The next step is to implement a function that consolidates each spine record (where possible) using data from the set of records that are linked to it as well as some user-defined consolidation rules (again configured in a yaml file). Using the link and consolidation steps in an iterative fashion (link-consolidate-link-consolidate…) ought to yield a fairly stable/robust spine with linked data and an audit trail.

I am currently using this package on my commodity desktop with public health data totalling around 80 million rows so far. It’s working pretty well and takes under 15 minutes for a linkage run.

Feedback always welcome.

Happy coding!



Awesome!!! I was looking for something like this not too long ago!