Data storage/loading for data produced by algorithms and metadata

For this task, I use DVC (Data Version Control). This is a language-agnostic tooling for managing data dependencies and pipelines. It requires a bit of discipline, but the idea is that “metadata” is really a general concept that goes beyond what you may typically consider (e.g parameters in a JSON or txt file) but also the scripts and environment that make that final data (which you can track with DVC by including the files and Manifest.toml as dependencies). You can then set up a pipeline on the output data by including it as a dependency in a subsequent stage. Changes to your metadata and resulting outputs are tracked in your git history.

Reproducibility and the ability to integrate with cloud storage is a huge thing for me, so I accept the overhead that using something like DVC induces. Because you’re working with smaller files and may need to be more nimble than me, I would recommend looking at DrWatson.jl. Although it is not a data management system like DVC or others, for some use cases it can certainly help automate the tracking and saving of simulations, which may be all you need.

Also, see the DrWatson.jl announcement for further discussion of these types of toolings.

4 Likes