Data storage/loading for data produced by algorithms and metadata

Hi,
I am sorry, if this question was answered somewhere else already or there is a package especially for my need but the more I searched for solutions to my problem the more confused I got.

I have algorithms producing multiple (tabular) data and I want to store these in human-readable format together with metadata which mostly consists of parameters of the models and algorithm. All of this should be stored together in one file. Data sizes are “small” < 1GB. Metadata should be manipulable after creation.

What would you suggest to do in this case? What packages should I consider using? I guess, as this is a problem most computational scientist are facing, there are some good solutions out there :slight_smile:

Thanks in advance for all recommendations/responses!

For this task, I use DVC (Data Version Control). This is a language-agnostic tooling for managing data dependencies and pipelines. It requires a bit of discipline, but the idea is that “metadata” is really a general concept that goes beyond what you may typically consider (e.g parameters in a JSON or txt file) but also the scripts and environment that make that final data (which you can track with DVC by including the files and Manifest.toml as dependencies). You can then set up a pipeline on the output data by including it as a dependency in a subsequent stage. Changes to your metadata and resulting outputs are tracked in your git history.

Reproducibility and the ability to integrate with cloud storage is a huge thing for me, so I accept the overhead that using something like DVC induces. Because you’re working with smaller files and may need to be more nimble than me, I would recommend looking at DrWatson.jl. Although it is not a data management system like DVC or others, for some use cases it can certainly help automate the tracking and saving of simulations, which may be all you need.

Also, see the DrWatson.jl announcement for further discussion of these types of toolings.

4 Likes

Thanks @platawiec for these recommendations! I do not need to track different “versions” of algorithms/packages yet and I think I can stick to metadata just being parameters for now. If this ever changes I will definitely have a closer look on https://dvc.org/.

The DrWatson package looks very promising to me. I will give this a try!

Hey @platawiec,

so I am playing around with DrWatson a bit and I like the idea of being able to choose the backend for saving. What backend would you recommend if I want to have few metadata and some tabular data saved into one file, which - ideally - should be human-readable as well? :slight_smile:

How abotu HDF5 format which is a very popular scientific data format?
https://github.com/JuliaIO/HDF5.jl

Write metadata using

h5writeattr("bar.h5", "foo", Dict("c"=>"value for metadata parameter c","d"=>"metadata d"))
h5readattr("bar.h5", "foo")

OK, not human readable but you should be able to convert to human readable format easily.
This is at a lower level that DVC or Dr Watson but may be useful to you.