Workfow for data analysis project: avoid re-running with many inputs, intermediate steps, and models

A typical data analysis project, for me, has many steps. For example, I might read in some datasets, fit some probabilistic models, create a database of simulations from this posterior model, and optimize something over that database of simulations. Then I might create plots of the raw data, the posterior simulations, and the optimized quantities. Each step in this workflow can be computationally intense, and so I’d like to avoid needing to re-run steps whenever possible.

In previous projects, I’ve used multi-language scripts with a workflow manager like Snakemake to keep my data analysis project in a straightforward workflow. This scripting workflow works well. However, I understand that this isn’t the best way to work with Julia (although there are hacks to make Julia sort of work like a scripting language, it’s not what it was designed for).

What is the best practice to keep track of dependencies in multi-step analysis workflows? the options that I’m aware of are

  • use Julia with Snakemake and eat the startup/load time
  • put everything in a main.jl file and input a bunch of files, commenting out the ones that don’t need to be re-run
  • try to cache things

Is there a better way? Many thanks!

use a notebook?

A notebook is great for the last step – take all the outputs and make plots. A notebook doesn’t do anything about dependency graphs (I update something in the middle and want to propagate changes through without re-running the previous steps)…

https://github.com/fonsp/Pluto.jl tracks cells dependency and auto re-run

Yeah I saw the juliacon presentation – really cool. But seems like this would just work in a single session, rather than over a few weeks/months while I’m working (possibly collaboratively) on a project?

looks like your intermediate results need to be written to disks anyways, then maybe use DrWatson.jl workflow or something.

1 Like

You probably want something akin to DrWatson.jl. It provides a structure for this kind of workflow.

Also, please don’t comment and uncomment lines as needed! Use global Booleans to control this kind of behavior.

Another trick I like is to keep many data-sets in a dictionary, and use @pack and @unpack to work with them inside functions.

1 Like

You probably want something akin to DrWatson.jl. It provides a structure for this kind of workflow.

This looks like what I was looking for! I’ll check it out in more detail.

please don’t comment and uncomment lines as needed! Use global Booleans to control this kind of behavior.

This was meant to be an example of what I don’t want to do! :slight_smile:

Another trick I like is to keep many data-sets in a dictionary, and use @pack and @unpack to work with them inside functions.

Are there performance benefits/costs to putting datasets in a dictionary vs. creating a struct for them?

No, as long as there is a function barrier before the actual computation, you can put something in a type-unstable object and then everything will infer just fine when you take it out of the object to do the computation.

3 Likes