A typical data analysis project, for me, has many steps. For example, I might read in some datasets, fit some probabilistic models, create a database of simulations from this posterior model, and optimize something over that database of simulations. Then I might create plots of the raw data, the posterior simulations, and the optimized quantities. Each step in this workflow can be computationally intense, and so I’d like to avoid needing to re-run steps whenever possible.
In previous projects, I’ve used multi-language scripts with a workflow manager like Snakemake to keep my data analysis project in a straightforward workflow. This scripting workflow works well. However, I understand that this isn’t the best way to work with Julia (although there are hacks to make Julia sort of work like a scripting language, it’s not what it was designed for).
What is the best practice to keep track of dependencies in multi-step analysis workflows? the options that I’m aware of are
use Julia with Snakemake and eat the startup/load time
put everything in a main.jl file and input a bunch of files, commenting out the ones that don’t need to be re-run
A notebook is great for the last step – take all the outputs and make plots. A notebook doesn’t do anything about dependency graphs (I update something in the middle and want to propagate changes through without re-running the previous steps)…
Yeah I saw the juliacon presentation – really cool. But seems like this would just work in a single session, rather than over a few weeks/months while I’m working (possibly collaboratively) on a project?
No, as long as there is a function barrier before the actual computation, you can put something in a type-unstable object and then everything will infer just fine when you take it out of the object to do the computation.