Suggestions for best practice in scaling analysis

Hi everyone,

I’m exploring whether there are good examples around of using Julia at scale that people can link?

At the moment I’m helping develop a package for model-based epidemiological inference Rt-without-renewal/EpiAware at main · CDCgov/Rt-without-renewal · GitHub . We’re at the stage where we want to collect inference results across a number of different scenarios to answer some interesting questions about effective epi modelling.

Looking at the space of handy workflow packages in julia we’ve looked at:

  • DrWatson.
  • Dagger.
  • Pipeline.
  • JobScheduler.

Each have their strengths/weaknesses; but we’re missing examples of a full workflow using anyone of them (except for DrWatson but we want a bit more functionality than that provides by itself).

In particular, I’m not clear what the best practice for arriving at a place like {targets}. At the moment, we are using a mixture of DrWatson and Dagger but are missing functionality like being about to easily graph viz the implied DAG of our tasks.

Wondering what the community thoughts are here?

5 Likes

Also working on this project and just echoing @SamBrand that any insights would be great.

Something that is really key for us being able to abstract our compute from the pipeline tools. It looks like the way to do that is Distributed.jl - maybe with `ClusterManagers.jl (though sadly our current available cloud compute is Azure Batch which doesn’t appear to have any support in the Julia Ecosystem) but again I haven’t found a huge number of projects that link pipeline best practices with non-local compute setups.

(all that being said I (but not @SamBrand) am very new to Julia so potentially just not looking in the right places)

An alternative to @SamBrand suggestions we have considered is to use something like https://www.nextflow.io/ to stick our pipeline together and connect to compute. Any success stories with using that with Julia?

1 Like

This is something I have been looking for, but sadly have not found a drop-in targets replacement. Right now, for the work I’m doing, a combination of a Makefile (with tmp/* files that I can invalidate manually to re-run the necessary components) and DrWatson’s @produce_or_load() is just about good enough to meet my needs, but probably not yours. I’m planning on migrating to just from Make, but I don’t think this would be much better for your workflow.

Regarding @seabbs point, I suspect that you already have examined them but on the off chance you haven’t, it might be worth looking into FLoops.jl / Transducers.jl / Folds.jl and FoldsDagger.jl to see if they fit your needs for abstracting away from specific pipeline tools. I haven’t used FoldsDagger.jl so can’t comment on whether it works (particularly as it hasn’t had much activity recently), but FLoops.jl and Folds.jl generally makes it easy to switch out the compute methods if you’re willing to manually handle parallelism (vs Dagger.jl managing nested loops etc).

1 Like

Again, this is likely not exactly what you’re looking for, but it’s possible that this package could provide some of the functionality you want?

It certainly takes a different approach to targets, but it does somewhat address the idea of recomputing objects when the underlying data changes (from first glance, only when you ‘fetch’ it, but I could be wrong as I haven’t had a chance to play around with it) @SamBrand