Exploratory research project workflow

I find it easy to use because it’s built on top of git and very agnostic to your programming language, tools, etc. If you are currently running scripts which read some files and write to others, and that chain of computations can be described by a DAG, then it will work for you.

How you structure experiments (by different git branches, or by commits, or just in the file structure) is up to you, although they provide some tooling for tracking metrics and comparing different branches/experiments. 90% of use cases will be covered just by pull/push/add/run/repro, kind of like how 90% of version control needs are covered by a few git commands.

It contrasts to other ML pipeline tools because it is less opinionated and less tied to a particular workflow/programming language. This was the case when I surveyed the options about 1 year ago, and others may have developed since then. I think some people may even use dvc in conjuction with other workflow tools, because it is relatively non-intrusive.

I don’t find there to be any strongly negative aspects to it, but it does keep you honest about any dependencies you mark and changes to those dependencies. Some people also find the one experiment/one branch idea to be too heavy, but I find that if you’re precise about what an “experiment” is then it’s not a problem.

1 Like

Indeed, dvc looks great and easy to use.
I read the documentation and watched the tutorials yesterday. And I noted a few potential issues, but there might be a simple solution to it:

  • not clear how to deal with advanced plots produced as results of the script, should one treat them as models and data tracked by dvc?
  • I had an impression that dvs assumes one pipeline per project. Wondering how one deals with several ones (separate folders with dvc init? how to navigate in between?)
  • it seems that repro and the pipeline requires starting Julia for every step. Sometimes with Julia, it is good to keep running the session.

@platawiec, could you please share your experience?

Sorry, some of the questions were too simple

An idea in CML is to reproduce them in the push pipeline

dvc repro takes the name of the pipeline as an agrument, so it is all scalable

I would love to see some examples already working with julia. Is there anything public?

If you have time, could you also give a bit of a high-level description?

I’m looking at a tutorial too and it seems like it’s just Git but with metadata files pointing to the data. Then, I don’t understand why the DAG is needed.

Ah, because DVC assumes the files to be too big to diff, so you cannot merge branches back into the main branch.

Yes, this is more of a script-based workflow for long computations that need to be tracked as part of a report or other deliverable. For rapid prototyping I use a Pluto notebook to try out new ideas/plots etc, then migrate it into the dvc pipeline as it finalizes.

Something like GitHub - dmolina/DaemonMode.jl: Client-Daemon workflow to run faster scripts in Julia would probably work to maintain a Julia session across runs within this framework, though I haven’t tried it myself.

When I say “DAG”, I’m referring to the computational pipeline that is the user specifies. A common pattern I follow is something like:

[data/initial-data]
-> 
(data_processing_script.jl)
->
[data/processed-data]
-> 
(data_plotting_script.jl)
->
[data/plots]

Where [] is denoting some folder containing files (input/output) and () is denoting a computational step. In general a script may input from multiple stages, but the pipeline always satisfies the properties of a directed acyclic graph (DAG). In this case, we just have a two-stage computation. Now, if I change the code in data_plotting_script.jl, dvc will recognize that the code is different and re-run only that stage. In contrast, if I change data_processing_script.jl or the contents of the data/initial-data folder, then it will re-run the chain up to the point where I made a change.

There is some overhead to this process! I need to specify the stages, make sure the scripts run, pay the julia start-up time cost, and think about the staging and logical boundaries of the computation. That’s why (as I mentioned above) I usually prototype it out in a Pluto notebook or through a test-driven development process in my IDE. But, if you just need it to store your data with your git commit, I find it works for that too.

2 Likes

How do you deal with the config files?

  • TOML, Yaml, or julia scripts?
  • Several nested or a single with const hyperparameters?
1 Like