[RFC] Mr Phelps - a distributed workflow orchestrator

So I was talking in chat with @kevbonham and other’s about tools that are missing in the Julia ecosystem. It became readily apparent that the ability to monitor workflows(distributed or just script stacks) is mostly a Python game right now. Meanwhile we have people scaling over HPC and doing heavy lifting with Julia! So, think Apache Airflow, but more integrated and geared toward Julia.

I’m making an RFC on a project before it begins because I don’t know what everyone would want from a tool like this. I know what I’d want, I’m thinking:

  1. A programmatic DAG handler with emitters & listeners. At the very least this should contain stateful information, and be aware of errors.
  2. A WebUI to monitor these jobs, kill them, reset them, inspect them.
  3. Should be able to glue/dispatch a variety of languages/tasks through some kind of abstract interface.
  4. Handle version control of the pipeline, and potentially the data pouring through it (if any)?

What I’m thinking as far as dependancies…
Dr. Watson, heavy use of Distributed.jl, Interact.jl, LightGraphs.jl, MetaGraphs.jl, libgit2.jl?

What I would like to hear about:
What tools are you all finding useful in your Julia workflows?
What tools do you wish you could incorporate into your Julia workflows?
Wanna pitch in once the ball gets rolling on this?
Do you foresee any serious issues here, or should I get the ball rolling now?

I think superficially the core functionality of this is somewhat trivial, its just abstracting from the monumentous efforts you the community has already done to make life easier. Maybe I’m overlooking something though :slight_smile:

20 Likes

Thanks for starting the thread! There’s also MakeItSo.jl that has some functionality that could be relevant. I made a place-holder repo (I like naming things after obscure gods… sue me), but there’s no actual work there.

In addition to Apache AirFlow mentioned above, prior art includes:

I think interaction with cluster tools like SLURM is also critical for me - maybe ClusterManagers.jl can help?

I’ve primarily been using snakemake recently, and it has the advantage of being able to run python code as well as arbitrary shell commands. The interface is a bit cludgey and this is a place I think julia could really shine. Major limitations that I’ve run into with snakemake are:

  • ability to handle unknown number of outputs / outputs that may or may not be created
  • ability to manually re-run individual subtasks, either on all inputs or a subset of inputs
3 Likes

I did some work on writing something for caching the results of expensive calculations, which is a bit related. (I called it Jake.jl as a play on Make and Drake and because I have a friend called Jake ;))

Worth noting that Dagger.jl already exists and that there has been some work on dispatching to the common HPC scheduling softwares (though I’ve never got that to work properly).

I think anyone designing something new in this space should take a look at the R package “Drake”.

Personally, I don’t care about a WebUI, dispatching non Julia processes or version control of code (versioning/provenance of cached results is important, tho).

2 Likes

Oh, and ability to parse and emit Common Workflow Language is not essential, but would be a huge boon to adoption I think.

1 Like

@ColinCaine - Mr (Jim) Phelps, is a character on a Mission Impossible character from the 60’s-70’s. He lead a clandestine crew of disruptive personalities to stop evil doers. Each member of the team had their special skillsets. His main role was in organizing the crew and coming up with really crafty, nonlinear, plans.

I think a UI is important not necessary but its something people often like. That being said I imagine a UI for this tool would take a few afternoons to make. A UI could also be used to construct the workflows, some workflows are not well described via linear code. A visualization can go a long way. Some pipelines are A->B or A->(B,C,D,E)->F, other’s not so much!

I’ve also cached calculations, and large slabs of data intermediates. Think that’s a really common pattern people in data science inevitably reinvent over and over though. Definitely a handy thing, but we’d have to make sure that’s user controlled at some level - not everyone wants to make a copy!

I think in order to properly establish provenance one should have version controlled code? I mean you could say, “ah the user called script X”, but if employee #111 changes script X and everything stops working … … See what I’m saying? I don’t think its too hard to call Git and store state of a pipeline.

I think supporting at the very least Bash scripts would be indispensible.

Dagger is interesting. Last I used it, or tried to use it, it was highly experimental and failed on pretty simple tasks. I bet it’s more mature now, but maybe there’s a way to integrate Dagger here as well. What I’m proposing isn’t a substitute, despite sounding superficially similar…

@kevbonham - I think the easiest way to do this would be to convert the yaml to json, and also allow for json. From what I’ve seen most of the operations appear trivial, nothing tooo nested or crazy. But definitely a learning curve there if we go that route. My gut instinct is we should, but parsing quickly leads to nightmares.

ClusterManager.jl 100% will be needed. SLURM is a PITA do you have a lot of experience there? I’ve only done some surface level SLURM things and don’t have a meants to test anything without standing something up at home.

So far these are all very good ideas :).

Also the name doesn’t have to be Mr Phelps but I thought it was a fun one kind of inline with Poirot and Dr. Watson.

2 Likes

Yep, I appreciate that others have different priorities :slight_smile:. Just describing what I would want (or not want) from such a tool.

I understand the linking of version control to provenance, and I think that’s the right thing to do for final results, but often if I’m working on something I won’t want to make a load of commits to capture some experimental changes to the work just so that my workflow management thing can run and correctly cache the results.

That’s kind of where I got up to with my project. I was trying to find a way to fingerprint what functions ran for each tracked function so that I only needed to re-execute if any of the functions changed. But it’s pretty difficult to make that fingerprint with dynamic multiple dispatch without running the code (which you don’t want to do because it is expensive).

I suspect there’s some kind of good-enough shortcut here (e.g. require saving the code in a file and just check the timestamp on the file + dependencies; or include all the methods in the fingerprint if you’re not sure which method for foo() is used).

Re: bash scripts, my meaning is that Julia can already fork external processes easily, so I don’t really see the need to provide functionality that a Julia lambda can handle (buildstep() = run(`build.sh`)). But maybe I’m being unimaginative and making forking a process a first class citizen is useful.

Re: name; I’m easy :smiley:

2 Likes

Yea now I see where you’re coming from… I’m going to let the idea of version controlling the pipeline stew and try to come up with a general plan for an architecture(completely open to change) so lots of people can jump in and not step on toes or anything. Could hash files? Throw in some meta data checks just incase? Maybe that’s too slow…

Yea there is good Julia Bash interop, but iirc, there are some things that can’t be done via julia bash due to the environment(something that happens in SLURM/HPC from time to time). Maybe that’s been changed or theres a way around those troubles I had a year ago though.

You make a good point though, focus on Julia first. Expand later as needed/requested.

Have you seen https://metaflow.org/ ?

1 Like

Nice find! Yea thats similar to what I want to do but this is less flexible(I think?). Ironically, if you look at the tutorial code it reminds me of something that’d be way less of a pain to implement in Julia

https://github.com/Netflix/metaflow/blob/master/metaflow/tutorials/04-playlist-plus/playlist.py

Good source of inspiration!

I would recommend taking a look at Dagger and trying to build some code around it. It provides the basics for letting you schedule work on arbitrary nodes in a cluster, which is vital when you’re dealing with lots of small, potentially independent steps in a workflow, and should be pretty easy to replace later if you find you want to use another solution. I’d be happy to help you with any problems you run into using Dagger for this; just PM/ping me.

Also, instead of baking in a full web UI, why not first define a REST API that can be used to monitor and even control the workflow graph? That way you don’t lock yourself into having to maintain a super beautiful, featureful and ergonomic UI (which many people will dislike anyway because the blue header is just too blue for them, and there’s either too much or too little Javascript used by it), and you can instead let users make their own to suit their preferences? You can still ship a simple, optional UI inside or outside the package to get people going, of course.

7 Likes

There is also Luigi in Python: GitHub - spotify/luigi: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

2 Likes

Alrighty, we’ll leverage Dagger, after looking at it a bit it seems a lot more mature, and people way smarter then me made it and maintain it. Admittedly, the latest pushes seem recent, but a lot of the code base seems old. Is this project active/stable? I see no documentation. I’ll have to grok the code again to see how its all hooked up now. And I will definitely take you up on your offer, because this could be a ready made solution to a lot of the heavy lifting.

REST API is a sound idea. I was debating on whether I wanted to go the REST route or something else. REST is probably the easiest and least custom to ask a user to set up, so we should probably go that route.

Again the UI isn’t intended to be the star of the show or anything, we can make it really modular and simple as a hook to a lower laying API :).

I’m still trying to plan how to break all this functionality out. I’m thinking one master package and maybe something like - 3 or 4 sub packages for now.

1 Like

There are a lot of tools in this domain. Yet everytime me or a team tried to settle on one there was always a “yea - but ___ doesn’t ___”. Let me make a post to ask:

What do people like about Luigi, MetaFlow, AirFlow, SnakeMake, etc?

What don’t you like?

What’s missing in this domain?

We need to address this up front so we can design for it.

From what I hear, everyone who uses CWL hates it.
But my sample size isn’t huge.
I have some exposure to Galaxy, which is a CWL based workflow GUI thing uses in Bioinfomatics and Speech Processing (turns out those have some interesting tooling overlap, esp w.r.t history of using bash to glue seriously complicated stuff together.)


You might like to take a look at DataDepsPaths.jl

I have been told its kind of like snakemake.

DataDepsPaths doesn’t currently work, and I probably won’t have time to work on it any time soon.
But it is the kind of design for DataDeps v2, which unifies the action of fetching (downloading), with post-fetch-processing (e.g. unpacking), into a single action of resolving: which boils down to “run arbitrary code, to create this file.”, and the idea that when ever one trys to access a file (a data dep path), it tried to resolve it if it doesn’t exist (which could trigger accessing a file …)

It’s definately not like what you are after, but I think it has some interesting ideas.
In particular, because its DAG of resolution is implicit in arbitrary code, it may be too hard to parallelize.

2 Likes

The user documentation for Dagger is it’s Readme :slight_smile:

This is a more complete example of use than the Readme: First steps with Dagger

Shashi did a video at JuliaCon2016 too: JuliaCon 2016 | Dagger.jl - A Framework and Scheduler for Parallel Computing | Shashi Gowda - YouTube

3 Likes

It’s true that Dagger is pretty light on commit activity right now, but it’s not dead by any means. I have plans to fix some scheduling inefficiencies (which any scheduler will run into, due to how Distributed transfers data) when I find the time and energy, but other than that the codebase is really quite solid, if a bit complicated to read. Regardless, the maintainers are active in Julia right now so you will have people to help you if you run into issues with the library.

3 Likes

I think this could be a version 2 thing. A complete tool without CWL integration is definitely possible - but I think it would be a nice expansion feature. I don’t think it’s necessary to worry about it from the beginning, it can be bolted on later.

I wouldn’t say a lot, but a fair amount, and I have access to a number of clusters that use it that I could use for testing. None of them can be plugged into CI, unfortunately, but I’ve been meaning to look for a solution there anyway. If we could provide guidance for including CI for workflows, I feel like that would be a huge advantage as well.

Snakemake’s approach to SLURM and other cluster managers is to make the user do most of the work - you can define default parameters (in terms of memory, cores, etc) and rule-specific parameters using a config file, and then you have to provide the sbatch command directly, eg

$ snakemake -s my_workflow.snakefile  --configfile config.yaml --cluster-config cluster.yaml     \
    --cluster "sbatch -n {cluster.processors} -N {cluster.nodes} -t {cluster.time} --mem {cluster.memory} -o output/logs/{rule}-%j.out -e output/logs/{rule}-%j.err -p my_partition"

This is a bit annoying as a user, but easier for the developer I think - again, we could start this way and bolt some convenience functions on after.

I totally get it, and it’s clever, my primary objection here is the default male-bias of things like this. “Oh look, all of the productivity and data science tools are named after male characters, I guess that field is for men.” I work at a women’s college and have done some work on the gender gap in my field - it’s easy to overlook stuff like this, and individually it may not be a huge deal (I doubt anyone explicitly has that thought above), but the cumulative effect can be quite detrimental.

I think that it’s not that it can’t be done, it’s that some things would be a pain to do. That said, I’m guessing that these pain points are things that might be worth surfacing as issues in the core language, so having the intention of doing everything in julia and then raising issues when things are hard would be worthwhile.

I believe this is how snakemake does it, and that’s mostly fine, but their handling of things like temporary files and ability to manually override isn’t great. Not sure if it’s because they haven’t bothered or because it’s hard, but for me that’s an important thing to get right.

I forgot to mention, more important to me than a GUI is self-documenting runs and good report generation.

The one my previous lab developed is an acronym for “Another Automated Data Analysis Management Application.”

5 Likes

Yeah, this is why I know about it. It may be that everyone that develops CWL hates it, but in bioinformatics it has a lot of uptake and the Galaxy integration makes things simple for users. My (admittedly limited) experience with it is that it’s really simple to do simple things, but it’s really hard to do complicated things.

1 Like

I think doing this (detecting when a target is out of date) perfectly in Julia is likely to be extremely difficult to do efficiently. I did write some stuff to start on it, and I’m up for exploring it further, but I think it’s also okay to take some shortcuts in the case that the git repo is dirty.

When the git repo is clean, we can be more authoritative that we’re running the exact code on disc.

1 Like