[RFC] Mr Phelps - a distributed workflow orchestrator

I would recommend taking a look at Dagger and trying to build some code around it. It provides the basics for letting you schedule work on arbitrary nodes in a cluster, which is vital when you’re dealing with lots of small, potentially independent steps in a workflow, and should be pretty easy to replace later if you find you want to use another solution. I’d be happy to help you with any problems you run into using Dagger for this; just PM/ping me.

Also, instead of baking in a full web UI, why not first define a REST API that can be used to monitor and even control the workflow graph? That way you don’t lock yourself into having to maintain a super beautiful, featureful and ergonomic UI (which many people will dislike anyway because the blue header is just too blue for them, and there’s either too much or too little Javascript used by it), and you can instead let users make their own to suit their preferences? You can still ship a simple, optional UI inside or outside the package to get people going, of course.

7 Likes

There is also Luigi in Python: GitHub - spotify/luigi: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

2 Likes

Alrighty, we’ll leverage Dagger, after looking at it a bit it seems a lot more mature, and people way smarter then me made it and maintain it. Admittedly, the latest pushes seem recent, but a lot of the code base seems old. Is this project active/stable? I see no documentation. I’ll have to grok the code again to see how its all hooked up now. And I will definitely take you up on your offer, because this could be a ready made solution to a lot of the heavy lifting.

REST API is a sound idea. I was debating on whether I wanted to go the REST route or something else. REST is probably the easiest and least custom to ask a user to set up, so we should probably go that route.

Again the UI isn’t intended to be the star of the show or anything, we can make it really modular and simple as a hook to a lower laying API :).

I’m still trying to plan how to break all this functionality out. I’m thinking one master package and maybe something like - 3 or 4 sub packages for now.

1 Like

There are a lot of tools in this domain. Yet everytime me or a team tried to settle on one there was always a “yea - but ___ doesn’t ___”. Let me make a post to ask:

What do people like about Luigi, MetaFlow, AirFlow, SnakeMake, etc?

What don’t you like?

What’s missing in this domain?

We need to address this up front so we can design for it.

From what I hear, everyone who uses CWL hates it.
But my sample size isn’t huge.
I have some exposure to Galaxy, which is a CWL based workflow GUI thing uses in Bioinfomatics and Speech Processing (turns out those have some interesting tooling overlap, esp w.r.t history of using bash to glue seriously complicated stuff together.)


You might like to take a look at DataDepsPaths.jl

I have been told its kind of like snakemake.

DataDepsPaths doesn’t currently work, and I probably won’t have time to work on it any time soon.
But it is the kind of design for DataDeps v2, which unifies the action of fetching (downloading), with post-fetch-processing (e.g. unpacking), into a single action of resolving: which boils down to “run arbitrary code, to create this file.”, and the idea that when ever one trys to access a file (a data dep path), it tried to resolve it if it doesn’t exist (which could trigger accessing a file …)

It’s definately not like what you are after, but I think it has some interesting ideas.
In particular, because its DAG of resolution is implicit in arbitrary code, it may be too hard to parallelize.

2 Likes

The user documentation for Dagger is it’s Readme :slight_smile:

This is a more complete example of use than the Readme: First steps with Dagger

Shashi did a video at JuliaCon2016 too: JuliaCon 2016 | Dagger.jl - A Framework and Scheduler for Parallel Computing | Shashi Gowda - YouTube

3 Likes

It’s true that Dagger is pretty light on commit activity right now, but it’s not dead by any means. I have plans to fix some scheduling inefficiencies (which any scheduler will run into, due to how Distributed transfers data) when I find the time and energy, but other than that the codebase is really quite solid, if a bit complicated to read. Regardless, the maintainers are active in Julia right now so you will have people to help you if you run into issues with the library.

3 Likes

I think this could be a version 2 thing. A complete tool without CWL integration is definitely possible - but I think it would be a nice expansion feature. I don’t think it’s necessary to worry about it from the beginning, it can be bolted on later.

I wouldn’t say a lot, but a fair amount, and I have access to a number of clusters that use it that I could use for testing. None of them can be plugged into CI, unfortunately, but I’ve been meaning to look for a solution there anyway. If we could provide guidance for including CI for workflows, I feel like that would be a huge advantage as well.

Snakemake’s approach to SLURM and other cluster managers is to make the user do most of the work - you can define default parameters (in terms of memory, cores, etc) and rule-specific parameters using a config file, and then you have to provide the sbatch command directly, eg

$ snakemake -s my_workflow.snakefile  --configfile config.yaml --cluster-config cluster.yaml     \
    --cluster "sbatch -n {cluster.processors} -N {cluster.nodes} -t {cluster.time} --mem {cluster.memory} -o output/logs/{rule}-%j.out -e output/logs/{rule}-%j.err -p my_partition"

This is a bit annoying as a user, but easier for the developer I think - again, we could start this way and bolt some convenience functions on after.

I totally get it, and it’s clever, my primary objection here is the default male-bias of things like this. “Oh look, all of the productivity and data science tools are named after male characters, I guess that field is for men.” I work at a women’s college and have done some work on the gender gap in my field - it’s easy to overlook stuff like this, and individually it may not be a huge deal (I doubt anyone explicitly has that thought above), but the cumulative effect can be quite detrimental.

I think that it’s not that it can’t be done, it’s that some things would be a pain to do. That said, I’m guessing that these pain points are things that might be worth surfacing as issues in the core language, so having the intention of doing everything in julia and then raising issues when things are hard would be worthwhile.

I believe this is how snakemake does it, and that’s mostly fine, but their handling of things like temporary files and ability to manually override isn’t great. Not sure if it’s because they haven’t bothered or because it’s hard, but for me that’s an important thing to get right.

I forgot to mention, more important to me than a GUI is self-documenting runs and good report generation.

The one my previous lab developed is an acronym for “Another Automated Data Analysis Management Application.”

5 Likes

Yeah, this is why I know about it. It may be that everyone that develops CWL hates it, but in bioinformatics it has a lot of uptake and the Galaxy integration makes things simple for users. My (admittedly limited) experience with it is that it’s really simple to do simple things, but it’s really hard to do complicated things.

1 Like

I think doing this (detecting when a target is out of date) perfectly in Julia is likely to be extremely difficult to do efficiently. I did write some stuff to start on it, and I’m up for exploring it further, but I think it’s also okay to take some shortcuts in the case that the git repo is dirty.

When the git repo is clean, we can be more authoritative that we’re running the exact code on disc.

1 Like

There are many many tools out there with related functionality, here’s a reasonably up-to-date list.

I think it’s useful to try to understand why this space is so fragmented: it touches on many separate concerns that are hard to separate cleanly. Here are a few:

  1. tracking provenance of data/versions/models/outputs
    • DVC does this in a language-agnostic way and without getting intertwined with the other concerns below; pachyderm does as well except that it is opinionated about requiring kubernetes.
  2. reproducibility, sometimes across different compute setups
    • containers usually help here; otherwise even for a given julia dependencies Manifest, your computation might work differently depending on what exact system libraries you have underneath, whether you have CUDA, etc.
  3. communicating what the pipeline definitions are
    • most people like to stay within the comfort of their favorite programming language, this is a big source of fragmentation
    • once you scratch the surface, the semantics of setting up jobs that depend on the outputs of other jobs can get hairy; take a look at how even a minimalist KISS framework that only concerns itself with defining tasks like doit feels the need to make the depends-on / is-up-to-date relation arbitrarily extensible.
  4. error handling and logging
  5. scaling and the concerns that come with it, like fault tolerance, resource scheduling, sometimes on-demand elastic scaling
  6. support for multiple users with some kind of permissions mechanism
  7. monitoring what jobs have been scheduled, their status

It’s not hard to see how all these are intertwined. For example, error handling and logging has a bearing on tracking provenance, scaling and communicating pipeline definitions. And in subtle ways: you may want to treat different kinds of errors differently depending on whether they are failures due to memory shortage (you may want to re-run with more memory), a bug (don’t re-run, debug), or data being unexpectedly weird (you want to look at the data that caused trouble). This ties the programming language used with the scheduling and tracking of jobs, which is often running outside of the language; frankly obnoxious.

Usually the most important consideration when trying to navigate this mess is who it’s for. Who is going to use it, what they are comfortable with or willing to adapt to, in what ways the pipeline setup is a facilitator for communications between different kinds of people who have different backgrounds, etc.

Most of the time, the better choice is not to subscribe to a single framework, but instead cobble together something for your particular use-case. For that, tools that do one well-delimited thing well tend to be most useful.

I feel like in the julian spirit of having small packages that address well-delimited functionality, a good way forward is to identify small chunks of functionality that would be useful for cobbling together custom pipeline setups. Not that I know what those chunks of functionality should be :wink: . Here’s a few random partially-baked ideas:

  • use a macro to mark certain task / function arguments as having sizes that could become arbitrarily large, and do useful things with that like:
    • fit a regression and extrapolate how long and how much memory calling the function with args of particular sizes is likely to take
    • make the predicted amount of required memory available to the task scheduler (so that you set your jobs up with enough memory automatically) and user and to drive progress meters.
    • set up a hook for breaking down a function call with large sized args into several calls with smaller chunks; leave it up to the scheduler whether it should break args down.
  • use macros to mark alternative implementations of a function as being equivalent; for example, one implementation could run on GPU, the other on CPU.
    • combine this with the runtime and memory predictor idea and this could turn into a less stubby AutoOffload, driven by predicted runtime.
  • push Dagger forward, it’s currently the one scheduler we’ve got.
  • tools to facilitate running tasks from within containers
  • for those who buy into kubernetes, write something on top of Kuber.jl to make running on top of k8s look more like ClusterManagers, though JuliaRun already does this.

I personally think that tracking provenance is best done outside of the language, and the git metaphor used by DVC works well.

So many random ideas, so little time…

9 Likes

Great write-up, thanks!

There’s also dat, which had a lot of buzz in my circles a couple of years ago, but I don’t know how wide the adoption is. I first understood it as git for data, but it seems like it’s more focused on sharing now :woman_shrugging:

I use airflow a lot and it does give me a lot of headaches, and the best list I’ve found about things that annoy me is in this one alternative tool (that I did not try):

1 Like

@oxinabox - I’m very interested in your project. I’ll grok it for inspiration, and who knows maybe there’s something in there worth borrowing.

@kolia - Nice to hear from you Kolia!

Yea I believe you are correct in that the more modular the better. That being said, I would like if it doesn’t become so abstract people will struggle to debug it. So centering on a few common motifs is a good idea, but if people go too far off the beaten trail they’ll get what they expect along the way( a bit of pain but still saved time ).

The idea of computing load-cost and planning accordingly would be fantastic, and ironically, not the hardest thing to do with julia’s inspection tools. For other languages - well, TBD. No problem to make something like this optional though.

Yea I think Dagger is a tool of interest here. It’s not far from what we all want it sounds like anyways. Just a little tricky to grok.

Kuber.jl is a nice find - also great to see it stable. Oh jeese… Yea there’s a lot of diverse needs, and I’m lacking a lot of background for a lot of them. I’m sure in 2 years there’ll be different needs as well. So flexibility is important, whatever the core of this is, it has to be pretty damn generic…

@kevbonham - You’re right we need a female presence in the package-scape. A female character who orchestrates many pieces of things and makes them into something awesome - hmmm. I can think of a few but please suggest your own :).

Hey any SLURM experience is better then mine :).

Alright, I’m going to have to do some heavy thinking on this, and how to break it down. I’m getting glimpses of ways to go, but I need to understand some of the available tooling better. @jpsamaroo - could you chat with me a bit from a high level about Dagger

Kevin already suggested a suitable name :slight_smile:

2 Likes

It can also be challenging to avoid tokenization - that is, we don’t want to name it after a female character because we need a token female character in the ecosystem. I think one of the best things the julia devs did after naming the language (aside from, you know, actually building it) was to make it part of the community standards that the name should not be gendered out turned into a character.

I’m probably over-emphasizing this, once one starts down this path it can get a bit crazy. And all the the development and design discussions are more important than bike shedding the name. But that’s one of the reasons I like Hapi (that’s what I named my place holder repo I looked above) - the god was androgenous, so sort of avoids the whole issue (And it’s on-theme because they controled the flow of the Nile - Nile Delta is like a DAG too :-P).

Another way to avoid the issue entirely it’s too follow the julia convention if naming packages with clear indications of what they do. AnalysisWorkflows.jl or ReproduciblePipelines.jl or something (though that’s way less fun).

10 Likes

+1 for something like AnalysisWorkflows.jl.

2 Likes

Naming can come naturally later. I’m willing to accept whatever makes people happy, I just don’t want anyone to feel left out.

I made a slack channel (#analysisworkflows) for this, and invited everyone except @fborda and @kolia (I couldn’t find your names in slack). Hopefully this lets us free flow ideas and designs? Also, the name is not stuck, its just descriptive for slack for now. Also I don’t want people to think that that slack channel is only for this project, go in there, ask questions, do your own thing - all good. I don’t own anything :), just trying to organize. Also, if you don’t want to slack thats fine too, just post here, message, or contribute.

Is it possible to make a nonlocal github? Or do they have to be tied to a single persons account? If so, lets draw straws or whatever. I could care less if it was on my account, I just want to try some ideas out and learn some things.

Make a GitHub “organization” - they’re free for open source projects and one can manage individual repos and permissions. A lot of Julia projects do this - Dagger and ClusterManagers etc are in the JuliaParallel org - and it’s especially useful if we end up wanting a bunch of small packages.

1 Like

Perfect! Okay. I can do that. Can we change the name of the organization later? I don’t want to commit to anything yet, but I did start hacking away trying basic things in ClusterManager/distributed.

Edit github tutorials imply we can rename. I’ll kick it off with AnalysisWorkflows for now.