State of the Art for Data Version Control?

Hi folks!

I am working with DrWatson.jl and am wondering: is there any suggested tools or workflows to handle data version control within Julia? I saw that there was DataSets.jl made by @c42f and some ongoing discussions that @Datseris was having. Wasn’t sure what the current state of the art is. Thanks!

~ tcp :deciduous_tree:

6 Likes

Hm, unfortunately there aren’t many options in JuliaLand. I have seen the DataSets.jl package as well, but personally I am not a fan: it’s too complicated for my liking.

In DrWatson there is an attempt to bring a metadata-attaching functionality, in a sense of attaching a declarative text file “toml” style to any dataset. This would give you full governance and in a sense version control since the metadata themselves could easily be version controlled. The ongoing attempt is described in this issue and this pull request. Its called “DrWatsonSim”.

The third option I am aware of is CaosDB: it is a scientific database management that has been test driven in several large scale collaborative scientific projects and would allow data management and provenance. However, it only has a barebones Julia implementation that hasn’t been updated in 3 years :frowning: I haven’t used CaosDB in my work because, the way it is now, actually putting data on the CaosDB server would take too much effort. However, I am sure that there could be many very cool things happening like automatic integration with DrWatson workflows to put your data automatically in a database entry by using the configuration container given in produce_or_load. But at the moment the developers do not see Julia as worthy to allocate their resources on. I hope they change their mind at some poin, as Python doesn’t have something as cool as DrWatson…

3 Likes

Do you mean for things larger than the github file size limit or what? For things under the limit, I simply use github.

1 Like

Have you considered non-Julia solutions?

If the data is public and not too large, I’d go with git.

For anything else, I’d suggest DVC - it seems quite popular.

You can think of it as git for data. It works well together with git. But instead of git tracking the data itself, git tracks only the dvc metadata/references and dvc itself gets you the actual files based on the metadata/references in your working tree.

6 Likes

Hey @Datseris :wave: thanks for your thoughts!

Yea, I got a sense of it being complicated as well but I haven’t tried it fully yet. Perhaps I am just not reading how to use it correctly.

I owe you a great thank you @svilupp ! DVC is the ticket for me and is perfect for my use case. Thanks! Having worked with it some, have to admit, it is a little cumbersome using the tool – will have to keep practicing with it!

1 Like

DVC is great if you’re comfortable with git, but Weights & Biases is a much more user friendly alternative IMO.

1 Like