State of the Art for Data Version Control?

TheCedarPrince · November 14, 2022, 4:17pm

Hi folks!

I am working with DrWatson.jl and am wondering: is there any suggested tools or workflows to handle data version control within Julia? I saw that there was DataSets.jl made by @c42f and some ongoing discussions that @Datseris was having. Wasn’t sure what the current state of the art is. Thanks!

~ tcp

Datseris · November 15, 2022, 9:14am

Hm, unfortunately there aren’t many options in JuliaLand. I have seen the DataSets.jl package as well, but personally I am not a fan: it’s too complicated for my liking.

In DrWatson there is an attempt to bring a metadata-attaching functionality, in a sense of attaching a declarative text file “toml” style to any dataset. This would give you full governance and in a sense version control since the metadata themselves could easily be version controlled. The ongoing attempt is described in this issue and this pull request. Its called “DrWatsonSim”.

The third option I am aware of is CaosDB: it is a scientific database management that has been test driven in several large scale collaborative scientific projects and would allow data management and provenance. However, it only has a barebones Julia implementation that hasn’t been updated in 3 years I haven’t used CaosDB in my work because, the way it is now, actually putting data on the CaosDB server would take too much effort. However, I am sure that there could be many very cool things happening like automatic integration with DrWatson workflows to put your data automatically in a database entry by using the configuration container given in produce_or_load. But at the moment the developers do not see Julia as worthy to allocate their resources on. I hope they change their mind at some poin, as Python doesn’t have something as cool as DrWatson…

tbeason · November 15, 2022, 1:16pm

Do you mean for things larger than the github file size limit or what? For things under the limit, I simply use github.

svilupp · November 16, 2022, 6:41am

Have you considered non-Julia solutions?

If the data is public and not too large, I’d go with git.

For anything else, I’d suggest DVC - it seems quite popular.

You can think of it as git for data. It works well together with git. But instead of git tracking the data itself, git tracks only the dvc metadata/references and dvc itself gets you the actual files based on the metadata/references in your working tree.

TheCedarPrince · November 29, 2022, 8:40pm

Hey @Datseris thanks for your thoughts!

Yea, I got a sense of it being complicated as well but I haven’t tried it fully yet. Perhaps I am just not reading how to use it correctly.

I owe you a great thank you @svilupp ! DVC is the ticket for me and is perfect for my use case. Thanks! Having worked with it some, have to admit, it is a little cumbersome using the tool – will have to keep practicing with it!

cjdoris · November 30, 2022, 9:52pm

DVC is great if you’re comfortable with git, but Weights & Biases is a much more user friendly alternative IMO.

Topic		Replies	Views
Data storage/loading for data produced by algorithms and metadata Data	4	1038	August 1, 2019
[ANN] DataToolkit.jl — Reproducible, flexible, and convenient data management Package Announcements package , announcement , data , reproducibility	19	3232	October 4, 2024
DrWatson - the perfect sidekick to your scientific inquiries! Package Announcements	35	4690	June 23, 2019
Exploratory research project workflow General Usage	26	2162	January 11, 2021
ANN: JuliaDB.jl Community	40	9707	November 13, 2018

State of the Art for Data Version Control?

Related topics