Workfow for data analysis project: avoid re-running with many inputs, intermediate steps, and models

jdossgollin · October 8, 2020, 6:34pm

A typical data analysis project, for me, has many steps. For example, I might read in some datasets, fit some probabilistic models, create a database of simulations from this posterior model, and optimize something over that database of simulations. Then I might create plots of the raw data, the posterior simulations, and the optimized quantities. Each step in this workflow can be computationally intense, and so I’d like to avoid needing to re-run steps whenever possible.

In previous projects, I’ve used multi-language scripts with a workflow manager like Snakemake to keep my data analysis project in a straightforward workflow. This scripting workflow works well. However, I understand that this isn’t the best way to work with Julia (although there are hacks to make Julia sort of work like a scripting language, it’s not what it was designed for).

What is the best practice to keep track of dependencies in multi-step analysis workflows? the options that I’m aware of are

use Julia with Snakemake and eat the startup/load time
put everything in a main.jl file and input a bunch of files, commenting out the ones that don’t need to be re-run
try to cache things

Is there a better way? Many thanks!

jling · October 8, 2020, 6:49pm

use a notebook?

jdossgollin · October 8, 2020, 6:52pm

A notebook is great for the last step – take all the outputs and make plots. A notebook doesn’t do anything about dependency graphs (I update something in the middle and want to propagate changes through without re-running the previous steps)…

jling · October 8, 2020, 6:54pm

https://github.com/fonsp/Pluto.jl tracks cells dependency and auto re-run

jdossgollin · October 8, 2020, 6:55pm

Yeah I saw the juliacon presentation – really cool. But seems like this would just work in a single session, rather than over a few weeks/months while I’m working (possibly collaboratively) on a project?

jling · October 8, 2020, 7:00pm

looks like your intermediate results need to be written to disks anyways, then maybe use DrWatson.jl workflow or something.

pdeffebach · October 8, 2020, 7:02pm

You probably want something akin to DrWatson.jl. It provides a structure for this kind of workflow.

Also, please don’t comment and uncomment lines as needed! Use global Booleans to control this kind of behavior.

Another trick I like is to keep many data-sets in a dictionary, and use @pack and @unpack to work with them inside functions.

jdossgollin · October 8, 2020, 7:14pm

You probably want something akin to DrWatson.jl. It provides a structure for this kind of workflow.

This looks like what I was looking for! I’ll check it out in more detail.

please don’t comment and uncomment lines as needed! Use global Booleans to control this kind of behavior.

This was meant to be an example of what I don’t want to do!

Another trick I like is to keep many data-sets in a dictionary, and use @pack and @unpack to work with them inside functions.

Are there performance benefits/costs to putting datasets in a dictionary vs. creating a struct for them?

pdeffebach · October 8, 2020, 7:21pm

No, as long as there is a function barrier before the actual computation, you can put something in a type-unstable object and then everything will infer just fine when you take it out of the object to do the computation.

Topic		Replies	Views
Snakemake with julia (and similarities with Dr Watson or..) General Usage workflow , scientific-project , drwatson , reproducibility	9	490	February 23, 2025
[RFC] Mr Phelps - a distributed workflow orchestrator Package Announcements	42	5601	August 4, 2021
Exploratory research project workflow General Usage	26	2162	January 11, 2021
Suggestions for best practice in scaling analysis Julia at Scale	3	408	June 16, 2024
Workflow question - how to guarantee no dependence on global state without long load times? New to Julia	19	1014	May 27, 2019

Workfow for data analysis project: avoid re-running with many inputs, intermediate steps, and models

Related topics