Snakemake with julia (and similarities with Dr Watson or..)

I have a collegue that works with Python and Snakemake, telling me how nice is this tool for scientific workflow management, but I haven’t understood much about it, and above all if what it provides is really needed much more in a python world rather than in a Julia one (for example, reproducibility/containerisation).

In the website of Snakemake they cite also Julia, is there anyone that use it with Julia? For which reasons? Do you have a public example to share ?

Is DrWatson similar in objectives but more tailored to Julia workflows, or is really something different ?

2 Likes

In the DrWatson paper we do discuss alternatives in other languages, but unfortunately snakemake isn’t there. Maybe it is a recent development? Would love to hear a summary from someone that have used it.

Not so new…

I would love too :slight_smile: :slight_smile: :slight_smile:

I was checking it out now. The framework focuses on reproducible (and cross-platform?) data analysis pipelines. DrWatson isn’t really about data management / analysis which may be the reason we didn’t compare in the paper? I don’t remember it was many years ago!

I believe @tecosaur 's DataToolkit.jl would be the Julia competitor in this space!

1 Like

Funnily enough there was a post 2h before this thread about basically the same topic by @jonathanBieler

2 Likes

As I understand DrWatson is a set of tools for setting up and easily do some commons operations in a reproducible way, but it’s not a workflow system. In snakemake, nextflow, etc or the dagger example I posted, one of the goal is to build a graph that capture the dependencies between the inputs files and the results you want to compute, and to execute that workflow automatically in parallel, with the ability to resume, or update only the part that needs to be updated.

e.g. if you have a workflow that starts from data A, compute B from it, and then make a plot C, you can represent it like this :

A → B → C

In Julia you could make a script like this :

include("make_A.jl")
include("make_B.jl")
include("make_C.jl")

Now let’s say you want to modify B but you don’t want to recompute A (it takes two days) so you modify your script :

#include("make_A.jl")
include("make_B.jl")
include("make_C.jl")

Two weeks later the data A have been updated so you rerun your script but you forgot that you commented the first script, so you silently don’t get the results you intended.

Of course that’s a simple example, but I’m sure many people have experienced these kind of problems in practice. DataToolkit seems to be doing some of that, but it seems more tailored to managing/ingesting the raw data that being a full workflow system with parallel execution, monitoring, etc.

The issue with snakemake or nextflow is that they add quite a bit of overhead, they have their own way of doing things you have to learn and adapt to, which are very useful in some contexts, but for “non-industrial” data science I would prefer something more lightweight and flexible with the minimum amount of boilerplate possible.

Thanks for the shoutout! :heart_eyes:

Yup. The way I’d put it is that DataToolkit has some of the pieces of a workflow system, but it isn’t one. There’s a similar story with Dagger.

That said, I designed DataToolkit to be able to do things I didn’t plan for it to be able to do, and I think it’s very much possible for it to gain some of the key missing pieces of functionality.

For example, I’ve been thinking of making a plugin that allows for “parametric data sets”, and I also have a few thoughts on how to make it so it can run across multiple machines at once.

Oh, to give an example of what it can currently do, I might as well show off the MetaGraphsNext extension:

This shows all the datasets I used for a project, as well as the dependencies between them.

There’s potential for a plugin that integrates with some parallelisation tool to split up dependencies and execute them in parallel, but nothing currently developed.

2 Likes

Just for clarity, what DataToolkit currently supports OOTB is:

  • build a graph that capture the dependencies between the inputs and the results
  • parallel workflow execution
  • the ability to resume a workflow
  • incremental/minimal updates
  • generic processing steps, that can be applied in bulk

If anybody is interested in helping tick a few more boxes, I’d be very happy to collaborate.

4 Likes