Snakemake with julia (and similarities with Dr Watson or..)

sylvaticus · February 20, 2025, 9:36am

I have a collegue that works with Python and Snakemake, telling me how nice is this tool for scientific workflow management, but I haven’t understood much about it, and above all if what it provides is really needed much more in a python world rather than in a Julia one (for example, reproducibility/containerisation).

In the website of Snakemake they cite also Julia, is there anyone that use it with Julia? For which reasons? Do you have a public example to share ?

Is DrWatson similar in objectives but more tailored to Julia workflows, or is really something different ?

Datseris · February 20, 2025, 10:40am

In the DrWatson paper we do discuss alternatives in other languages, but unfortunately snakemake isn’t there. Maybe it is a recent development? Would love to hear a summary from someone that have used it.

sylvaticus · February 20, 2025, 10:43am

Not so new…

I would love too

Datseris · February 20, 2025, 10:59am

I was checking it out now. The framework focuses on reproducible (and cross-platform?) data analysis pipelines. DrWatson isn’t really about data management / analysis which may be the reason we didn’t compare in the paper? I don’t remember it was many years ago!

mrufsvold · February 20, 2025, 11:29am

I believe @tecosaur 's DataToolkit.jl would be the Julia competitor in this space!

abraemer · February 20, 2025, 6:28pm

Funnily enough there was a post 2h before this thread about basically the same topic by @jonathanBieler

jonathanBieler · February 20, 2025, 8:58pm

As I understand DrWatson is a set of tools for setting up and easily do some commons operations in a reproducible way, but it’s not a workflow system. In snakemake, nextflow, etc or the dagger example I posted, one of the goal is to build a graph that capture the dependencies between the inputs files and the results you want to compute, and to execute that workflow automatically in parallel, with the ability to resume, or update only the part that needs to be updated.

e.g. if you have a workflow that starts from data A, compute B from it, and then make a plot C, you can represent it like this :

A → B → C

In Julia you could make a script like this :

include("make_A.jl")
include("make_B.jl")
include("make_C.jl")

Now let’s say you want to modify B but you don’t want to recompute A (it takes two days) so you modify your script :

#include("make_A.jl")
include("make_B.jl")
include("make_C.jl")

Two weeks later the data A have been updated so you rerun your script but you forgot that you commented the first script, so you silently don’t get the results you intended.

Of course that’s a simple example, but I’m sure many people have experienced these kind of problems in practice. DataToolkit seems to be doing some of that, but it seems more tailored to managing/ingesting the raw data that being a full workflow system with parallel execution, monitoring, etc.

The issue with snakemake or nextflow is that they add quite a bit of overhead, they have their own way of doing things you have to learn and adapt to, which are very useful in some contexts, but for “non-industrial” data science I would prefer something more lightweight and flexible with the minimum amount of boilerplate possible.

tecosaur · February 21, 2025, 2:07am

Thanks for the shoutout!

Yup. The way I’d put it is that DataToolkit has some of the pieces of a workflow system, but it isn’t one. There’s a similar story with Dagger.

That said, I designed DataToolkit to be able to do things I didn’t plan for it to be able to do, and I think it’s very much possible for it to gain some of the key missing pieces of functionality.

For example, I’ve been thinking of making a plugin that allows for “parametric data sets”, and I also have a few thoughts on how to make it so it can run across multiple machines at once.

Oh, to give an example of what it can currently do, I might as well show off the MetaGraphsNext extension:

This shows all the datasets I used for a project, as well as the dependencies between them.

There’s potential for a plugin that integrates with some parallelisation tool to split up dependencies and execute them in parallel, but nothing currently developed.

tecosaur · February 21, 2025, 5:24am

Just for clarity, what DataToolkit currently supports OOTB is:

build a graph that capture the dependencies between the inputs and the results
parallel workflow execution
the ability to resume a workflow
incremental/minimal updates
generic processing steps, that can be applied in bulk

If anybody is interested in helping tick a few more boxes, I’d be very happy to collaborate.

moble · February 23, 2025, 8:39pm

I think everyone interested in any of this might also be interested in showyourwork, which uses snakemake, but does a lot more.

@MilesCranmer had a very nice series of posts starting here discussing this, comparing to other tools (including the excellent DrWatson, which I particularly love!), and even providing an example for use with Julia.

Topic		Replies	Views
Workfow for data analysis project: avoid re-running with many inputs, intermediate steps, and models New to Julia workflow	8	897	October 8, 2020
Dagger + Dates = snakemake? General Usage data , speculative , design	3	300	February 25, 2025
[RFC] Mr Phelps - a distributed workflow orchestrator Package Announcements	42	5601	August 4, 2021
[ANN] DataToolkit.jl — Reproducible, flexible, and convenient data management Package Announcements package , announcement , data , reproducibility	19	3231	October 4, 2024
Is there a Julia package similar to the Python's "ShowYourWork"? VS Code	17	2183	May 1, 2023

Snakemake with julia (and similarities with Dr Watson or..)

Related topics