[ANN] DataToolkit.jl — Reproducible, flexible, and convenient data management

image

Reproducible, flexible, and convenient data management


DataToolkit.jl is part of a trio of packages that aim to provide an exceptionally convenient and extensible approach to data management. Compared to pipelines (e.g. nextflow, snakemake) this focuses more on helping when you want to make the data handling in an individual project/one-shot analysis easy and reproducible.

It subsumes the capabilities of (a non-exhaustive list):

Here’s a taster of what using it looks like:

image

See the Introduction in the docs and/or my JuliaCon23 presentation [slides] to learn more :slight_smile:

I’ve just released v0.7 :tada:, and pending development/feedback I plan on tagging 1.0 late this year. I have a few things I want to do before then such as:

  • Changing the checksum to KangarooTwelve/Blake3
  • Settle on a public API
  • More docs
  • More tests
  • More supported storage backends (artifacts, S3)

If this sounds of interest, please give it a whirl and let me know if you have any feedback, I’d be keen to hear thoughts/experiences with this :grinning:.

44 Likes

This looks very interesting! Could you give a quick summery of the similarities/differences compared to DrWatson.jl?

4 Likes

To quote George:

the mission with DrWatson is to keep things as simple as possible. And DrWatson is not so much a data management system, but a “scientific project assistant” let’s say.

So it’s a bit like being asked to compare DataFrames.jl and MLJ.jl — both “do things” with tables but have entirely different concerns.

That said, this could probably be used in a DrWatson project, and maybe even integated more directly.

1 Like

How does this system deal with the versions of the loader packages? It seems to me like when CSV is needed, you’re prompted to install it in the usual way, giving you a manifest entry. But the data.toml doesn’t know about the version, so it would not be enough to be sure you get the same data on a different machine after deserialization. I guess checksumming can only apply to the raw bytes, so how can a user be sure that data is loaded as originally intended? Or is the system intended to work together with your manifest, so that only the combination of data.toml plus manifest.toml suffices to describe what you did.

1 Like

Exactly what you suppose at the end. Data.toml is expected to live within a Julia project (next to the Project.toml and Manifest.toml). We let Pkg.jl do the excellent job it already does at reproducing package environments, and together with DataToolkit you get complete reproducibility of a project that involves loading data :slight_smile:

Hmm, maybe it would be good to have an example repo showing an example data project and data package built using DataToolkit:thinking:. For the first of these, I could just save the end result of the tutorial.

3 Likes

Ok so it’s less of a standalone data description than I first thought. But I guess that makes sense, given the many options data loading packages expose and which can be subject to change.

1 Like

The only truly “standalone” way of handling data is a tool that never changes and just does basic downloads and maybe a checksum on a pile of bytes.

If you want to go beyond this (as DataToolkit tries to), you need to grapple with the fact that the data becomes intertwined with the way it is loaded/processed. Some tools (e.g. DataLad) let you make this reproducible by containerising the processing step(s), but if we indulge ourselves and make a Julia-specific tool we can just use Pkg.jl :smiley:.

It’s also worth noting that the version of DataToolkit (+ dependencies) will itself also be managed by Pkg.jl of course, and in this way everything is (IMO) rather nicely bundled up.

In future, it could be nice to have a way to concatenate {Project,Manifest,Data}.toml into a single file that would be as portable as you seem to be thinking of. That said, .zip and .tar files already exist…

All that said, barring drastic changes to the loaders used (e.g. CSV.jl) the data files should be rather portable, and if you want to go a step further you can create a package that has a Data.toml along with a Project.toml. That’s a pretty good way of sharing datasets (since you can then just do using SomeDataSets and then … use them).

5 Likes

I have a few datasets that I use across multiple projects and, depending on the project, I only ever need to load a subset of all the available columns. When loading a dataset with DataToolkit.jl, is there a way to tell it which columns from a tabular dataset that I want to load? Something like the select keyword argument from CSV.jl?

Not OOTB, but there are two ways you could do so:

  • Via derived data sets (e.g. create a-1 which is column 1 of a)
  • Via a plugin that adds that behaviour (plugins can do a lot)
1 Like

Great work! I really like the data REPL - very creative. The JuliaCon talk is great too, easy to follow - the part at 10:00 is awesome.

DrWatson.jl is more about creating well-organized, reproducible environments for performing data analysis. It builds off the package manager and uses a custom environment to ensure all calculations can be replicated in a consistent package environment. DataDeps.jl can be used to import the data into a DrWatson project. This can be very convenient when the data is bulky and/or must be stored according to a journal’s or your organization’s requirements. As a government scientist, I publish the data my organization repository (made findable through data.gov) and then a DrWatson project on Github to process the data.

They are both super useful.

5 Likes

I agree. This sounds very promising. I’ve missed this announcement post but I’ll try to read in more detail and give more feedback. I am also happy to give support to better/more/simpler integration between the two packages!

5 Likes