[RFC] Mr Phelps - a distributed workflow orchestrator

There are many many tools out there with related functionality, here’s a reasonably up-to-date list.

I think it’s useful to try to understand why this space is so fragmented: it touches on many separate concerns that are hard to separate cleanly. Here are a few:

  1. tracking provenance of data/versions/models/outputs
    • DVC does this in a language-agnostic way and without getting intertwined with the other concerns below; pachyderm does as well except that it is opinionated about requiring kubernetes.
  2. reproducibility, sometimes across different compute setups
    • containers usually help here; otherwise even for a given julia dependencies Manifest, your computation might work differently depending on what exact system libraries you have underneath, whether you have CUDA, etc.
  3. communicating what the pipeline definitions are
    • most people like to stay within the comfort of their favorite programming language, this is a big source of fragmentation
    • once you scratch the surface, the semantics of setting up jobs that depend on the outputs of other jobs can get hairy; take a look at how even a minimalist KISS framework that only concerns itself with defining tasks like doit feels the need to make the depends-on / is-up-to-date relation arbitrarily extensible.
  4. error handling and logging
  5. scaling and the concerns that come with it, like fault tolerance, resource scheduling, sometimes on-demand elastic scaling
  6. support for multiple users with some kind of permissions mechanism
  7. monitoring what jobs have been scheduled, their status

It’s not hard to see how all these are intertwined. For example, error handling and logging has a bearing on tracking provenance, scaling and communicating pipeline definitions. And in subtle ways: you may want to treat different kinds of errors differently depending on whether they are failures due to memory shortage (you may want to re-run with more memory), a bug (don’t re-run, debug), or data being unexpectedly weird (you want to look at the data that caused trouble). This ties the programming language used with the scheduling and tracking of jobs, which is often running outside of the language; frankly obnoxious.

Usually the most important consideration when trying to navigate this mess is who it’s for. Who is going to use it, what they are comfortable with or willing to adapt to, in what ways the pipeline setup is a facilitator for communications between different kinds of people who have different backgrounds, etc.

Most of the time, the better choice is not to subscribe to a single framework, but instead cobble together something for your particular use-case. For that, tools that do one well-delimited thing well tend to be most useful.

I feel like in the julian spirit of having small packages that address well-delimited functionality, a good way forward is to identify small chunks of functionality that would be useful for cobbling together custom pipeline setups. Not that I know what those chunks of functionality should be :wink: . Here’s a few random partially-baked ideas:

  • use a macro to mark certain task / function arguments as having sizes that could become arbitrarily large, and do useful things with that like:
    • fit a regression and extrapolate how long and how much memory calling the function with args of particular sizes is likely to take
    • make the predicted amount of required memory available to the task scheduler (so that you set your jobs up with enough memory automatically) and user and to drive progress meters.
    • set up a hook for breaking down a function call with large sized args into several calls with smaller chunks; leave it up to the scheduler whether it should break args down.
  • use macros to mark alternative implementations of a function as being equivalent; for example, one implementation could run on GPU, the other on CPU.
    • combine this with the runtime and memory predictor idea and this could turn into a less stubby AutoOffload, driven by predicted runtime.
  • push Dagger forward, it’s currently the one scheduler we’ve got.
  • tools to facilitate running tasks from within containers
  • for those who buy into kubernetes, write something on top of Kuber.jl to make running on top of k8s look more like ClusterManagers, though JuliaRun already does this.

I personally think that tracking provenance is best done outside of the language, and the git metaphor used by DVC works well.

So many random ideas, so little time…

9 Likes

Great write-up, thanks!

There’s also dat, which had a lot of buzz in my circles a couple of years ago, but I don’t know how wide the adoption is. I first understood it as git for data, but it seems like it’s more focused on sharing now :woman_shrugging:

I use airflow a lot and it does give me a lot of headaches, and the best list I’ve found about things that annoy me is in this one alternative tool (that I did not try):

1 Like

@oxinabox - I’m very interested in your project. I’ll grok it for inspiration, and who knows maybe there’s something in there worth borrowing.

@kolia - Nice to hear from you Kolia!

Yea I believe you are correct in that the more modular the better. That being said, I would like if it doesn’t become so abstract people will struggle to debug it. So centering on a few common motifs is a good idea, but if people go too far off the beaten trail they’ll get what they expect along the way( a bit of pain but still saved time ).

The idea of computing load-cost and planning accordingly would be fantastic, and ironically, not the hardest thing to do with julia’s inspection tools. For other languages - well, TBD. No problem to make something like this optional though.

Yea I think Dagger is a tool of interest here. It’s not far from what we all want it sounds like anyways. Just a little tricky to grok.

Kuber.jl is a nice find - also great to see it stable. Oh jeese… Yea there’s a lot of diverse needs, and I’m lacking a lot of background for a lot of them. I’m sure in 2 years there’ll be different needs as well. So flexibility is important, whatever the core of this is, it has to be pretty damn generic…

@kevbonham - You’re right we need a female presence in the package-scape. A female character who orchestrates many pieces of things and makes them into something awesome - hmmm. I can think of a few but please suggest your own :).

Hey any SLURM experience is better then mine :).

Alright, I’m going to have to do some heavy thinking on this, and how to break it down. I’m getting glimpses of ways to go, but I need to understand some of the available tooling better. @jpsamaroo - could you chat with me a bit from a high level about Dagger

Kevin already suggested a suitable name :slight_smile:

2 Likes

It can also be challenging to avoid tokenization - that is, we don’t want to name it after a female character because we need a token female character in the ecosystem. I think one of the best things the julia devs did after naming the language (aside from, you know, actually building it) was to make it part of the community standards that the name should not be gendered out turned into a character.

I’m probably over-emphasizing this, once one starts down this path it can get a bit crazy. And all the the development and design discussions are more important than bike shedding the name. But that’s one of the reasons I like Hapi (that’s what I named my place holder repo I looked above) - the god was androgenous, so sort of avoids the whole issue (And it’s on-theme because they controled the flow of the Nile - Nile Delta is like a DAG too :-P).

Another way to avoid the issue entirely it’s too follow the julia convention if naming packages with clear indications of what they do. AnalysisWorkflows.jl or ReproduciblePipelines.jl or something (though that’s way less fun).

10 Likes

+1 for something like AnalysisWorkflows.jl.

2 Likes

Naming can come naturally later. I’m willing to accept whatever makes people happy, I just don’t want anyone to feel left out.

I made a slack channel (#analysisworkflows) for this, and invited everyone except @fborda and @kolia (I couldn’t find your names in slack). Hopefully this lets us free flow ideas and designs? Also, the name is not stuck, its just descriptive for slack for now. Also I don’t want people to think that that slack channel is only for this project, go in there, ask questions, do your own thing - all good. I don’t own anything :), just trying to organize. Also, if you don’t want to slack thats fine too, just post here, message, or contribute.

Is it possible to make a nonlocal github? Or do they have to be tied to a single persons account? If so, lets draw straws or whatever. I could care less if it was on my account, I just want to try some ideas out and learn some things.

Make a GitHub “organization” - they’re free for open source projects and one can manage individual repos and permissions. A lot of Julia projects do this - Dagger and ClusterManagers etc are in the JuliaParallel org - and it’s especially useful if we end up wanting a bunch of small packages.

1 Like

Perfect! Okay. I can do that. Can we change the name of the organization later? I don’t want to commit to anything yet, but I did start hacking away trying basic things in ClusterManager/distributed.

Edit github tutorials imply we can rename. I’ll kick it off with AnalysisWorkflows for now.

https://github.com/AnalysisWorkflows - kicked off I added some of you as members. Feel free to ask to be removed or added. At this point I figure we can all hack away in whatever we want. Don’t focus on making a release. Let’s learn the ecosystem see what functions are useful and where they belong.

I made a repo called Play ground. I don’t know if the MIT license should even have a copyright? If not let’s just remove it? I don’t know laws. but anyways expect to see some desperate screwing up in there soon by me: https://github.com/AnalysisWorkflows/Playground.

Cluster wide workflow management and executing is IMO a specific application of distributed agent simulation. For an example see Python’s ray module (GitHub - ray-project/ray: Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerating ML workloads.).

1 Like

Nice one, yea I’ve heard of Ray. It seems promising for languages like python that really need help in scaling out. Pretty sure most of what Ray offers functionality-wise is in base Julia ;). I’m not making a comment on RPC vs RDMA, specifics of which actor model is best, or other serious HPC concerns at the bleeding edge of distributed computing… I can’t comment on those things with my background. But, if there was a significant advantage to the Ray model, I bet Julia could adapt it without too much trouble given the similarities between the two.

I’m thinking most people would like: getting julia to reproducibly scale, and have interconnected but modular pipelines across heterogeneous or homogeneous systems easily. That’s mostly what I want out of this. All that fancy HPC business will have to be contributed by people who don HPC tophats and suitcoats, or made from the ground up in a separate effort (whatever boats your float).

Being able to easily stand up a pipeline in julia for ETL on one box, and pipe it to a modelling box, and cache results in a DB, while looking for weird behaviour would be amazing for me. I think people running lots of physical simulations and large scale analysis’ would also like this. A lot of domain experts aren’t framework geeks, and want to reserve their mental capacity for their domain, that’s who I’m hoping to help.

Also - Julia offers a world class Diff Eq ecosystem, and a really awesome graph analysis packagescape. Both are constantly growing, and almost too easy to use. I want to leverage these tools native to Julia’s ecosystem for fast and moderately intelligent distribution of reproducible and controllable Julia work loads. I’m guessing the basics of this won’t even be too hard because everyone else has already done the super hard stuff - but I can’t tell if I am being overly optimistic yet or not? If I’m not this would be a crazy showcase of the power of the language and ecosystem. If I’m wrong I’m gonna learn a lot of stuff hopefully with a nice small crew of people!

Most workflow frameworks assume a static (i.e. predefined) computation graph: the DAG is computed first and then traversed for execution. In Ray, based on the Actor model, the order of computation can depend on execution (i.e. “runtime”) events; i.e. the DAG is dynamic. In Julia the actively developed Richard Palethorpe / Actors.jl · GitLab framework may be leveraged for this.

2 Likes

I’m currently looking at options in this space for my team and folks in my company who have pored over the many options already are recommending mlflow

Looks like it might be a good fit for us after a few minutes of scratching the surface, which is a good sign. The impl is python, but the usage doesn’t have to be, there’s a python, R and REST APIs. A julia API would fit right in I would think, if we were to contribute one.

@anon92994695 could you please add me to the AnalysisWorkflows org, so I can see what you guys are up to?

1 Like

Never heard of MLFlow, looks a bit verbose. But nice find.

There’s no private repos or anything so what you see on there right now is public view. So far I’m the only contributor. If you want to give me your github information I can add you.

Yea I think I’m going to just make a tool that does what I want it too and cover general use cases with common tools. If you all want kubernetes + other goodies, I don’t have the means to test that. But, maybe someone else will contribute and find a way. For your team/company it sounds like you all have some very diverse needs that I’m likely not going to end up directly accommodating. So maybe MLFlow is the way to go for you all.

My early goals are just to find likeminded people, learn some things, and build a modular approach to the general problem. If people add on more - great. We’ll see it’ll probably be slow moving. So if you have immediate needs, I wouldn’t invest in it unless you are able to contribute.

MsPhelps?

I suggest that you [all] design this believing there will be a good way to provide it with a UI down the line without spending any real time on coding that part. Rather let your capabilities, capacities, facilities, and related prepackaged tooling be as much of the ecosystem is, easily utilized by others for reasons of their own. Clean small cooperative parts go farther than elaborated windup toys.

2 Likes

Jeffrey that is the plan!

The goals as I see them first and foremost are

  1. establish what tools already exist in the ecosystem to leverage
  2. play with them and see what parts do what, and how they do it
  3. grab at pieces from the outside in and make a crude mockup
  4. shatter the mockup, refactor
  5. no clue yet!

I’m aiming for a truly modular design. I have some ideas but I haven’t run into all the walls of what’s already been done yet. It will probably take a bit of time. This has been alot of fun so far.

2 Likes

Alright so I spent a couple hours now and have a skeleton for an API that I like, a feasible workflow, and some convenience functions. What I concluded was the following,

Multiple schemes to consider:

  1. SnakeMake like scheme: Each file is handled from start to finish by a single node and all computations performed accordingly(embarassingly parallel?).
  2. Spark: DAG is ‘planned’ and actions are ‘joined’ to make ‘efficient’ parallel computations. This requires introspection, a passion for headaches, and an enjoyment of the JVM.
  3. Being Perfect: Depending on IO size/method compute available: resources, data, and computations, are dynamically linked & driven by load. This requires introspection, dynamic optimizations, and dynamic scheduling.
  4. Something in the middle: User constrains resources to machines, user dictates, minimum & maximum allowance for parallelism on a given taskset, user chooses to memoize data transforms or pass them to new locations, user writes good code that is parallel enough. When resources free up, or a new task is emitted, workers will greedily scramble to finish that task. All tasks are of equal importance.

I’m gearing for number 4. It’s the most flexible, modular approach. It lets a user who doesn’t know better, not think so much and get a result, but if someone wants to think a bit harder they could do a lot with it. I’m not here to detail how all of this is going to happen. I have some good ideas now, but lots of screwing around to do before I even pretend like this is a thing.

2 Likes

It was brought to my attention there is an awesome package out there: https://github.com/invenia/Dispatcher.jl

It’s really similar to what I was aiming to do! I think there’s still merit to exploring an alternate approach, but it’s actually really eerie to see how I made some of the same design decisions as they did. So for a hack - I’m happy I was thinking about things similar to people who know what they are doing :).

So I highly reccommend checking out Dispatcher, and Dagger. I don’t know the future for this effort aka AnalysisWorkflows. But I think I’ll keep exploring this anyways :).

5 Likes