As I understand DrWatson is a set of tools for setting up and easily do some commons operations in a reproducible way, but it’s not a workflow system. In snakemake, nextflow, etc or the dagger example I posted, one of the goal is to build a graph that capture the dependencies between the inputs files and the results you want to compute, and to execute that workflow automatically in parallel, with the ability to resume, or update only the part that needs to be updated.
e.g. if you have a workflow that starts from data A, compute B from it, and then make a plot C, you can represent it like this :
A → B → C
In Julia you could make a script like this :
include("make_A.jl")
include("make_B.jl")
include("make_C.jl")
Now let’s say you want to modify B but you don’t want to recompute A (it takes two days) so you modify your script :
#include("make_A.jl")
include("make_B.jl")
include("make_C.jl")
Two weeks later the data A have been updated so you rerun your script but you forgot that you commented the first script, so you silently don’t get the results you intended.
Of course that’s a simple example, but I’m sure many people have experienced these kind of problems in practice. DataToolkit seems to be doing some of that, but it seems more tailored to managing/ingesting the raw data that being a full workflow system with parallel execution, monitoring, etc.
The issue with snakemake or nextflow is that they add quite a bit of overhead, they have their own way of doing things you have to learn and adapt to, which are very useful in some contexts, but for “non-industrial” data science I would prefer something more lightweight and flexible with the minimum amount of boilerplate possible.