Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia?

MilesCranmer · February 10, 2024, 11:27pm

Just wondering if anybody is maintaining ClusterManagers.jl? Major components of it have been broken for over two years now… (example). Now it seems that even the previously well-supported cluster managers are a bit sketchy (Slurm broken · Issue #196 · JuliaParallel/ClusterManagers.jl · GitHub).

Distributed compute is such an important part of a scientific computing, I’m a bit worried that this part of the ecosystem is in such a dismal state.

If ClusterManagers is not a reliable option, how do people typically do distributed computation in Julia? Maybe there’s some package everybody is using these days that I am not aware of…

adienes · February 10, 2024, 11:44pm

I tried to find a solution for this last year and gave up after a while and ended up using ray, which has gone pretty smoothly

I suspect you’ll have an easier time using python/ray to do all the cluster stuff and using subprocess.run to launch Julia processes, than you will trying to do everything in Julia

MilesCranmer · February 10, 2024, 11:58pm

Thanks. Do you call Ray from Julia(/PythonCall)? Or do you mean you launch your entire workflow from Python?

Also, how do you pass Julia objects between workers? (Maybe just serializing/unserializing to a file on a shared file system?)

adienes · February 11, 2024, 12:11am

when in a cluster/distributed context, I try to stick to the basics and avoid any workflow that sounds even mildly complex

I launch everything from python, but use subprocess.run to call the workhorse (julia or c++). probably PythonCall would work but I haven’t tried it.

if I need an object back to pass back and forth, I’ll try to serialize to a file inside the subprocess (julia), then read it back when the subprocess returns to the ray (python) worker which can put it into the distributed object store. so the serialization format has to be something both languages can understand. in my case that’s parquet but there are probably better options

note my workflow is “embarrassingly parallel” and mostly linear, so ymmv

MilesCranmer · February 11, 2024, 12:17am

Thanks, I see. Yeah I definitely need to have fairly frequent communication all-to-all between workers so sadly this wouldn’t work for me. But I will still check out ray for launching… maybe there’s a super simple way to just have every worker communicate via files. (My stuff is not high throughput, it just needs to be fairly regular and asynchronous)

jling · February 11, 2024, 1:17am

I think MPI. But our infrastructure don’t have MPI, so most of the time the answer is we don’t…

CameronBieganek · February 11, 2024, 2:31am

Would Dagger.jl work for you?

MilesCranmer · February 11, 2024, 2:53am

Dagger.jl is a high-level wrapper of Distributed.jl, so doesn’t by itself handle any multi-node stuff (until someone fixes ClusterManagers.jl)

MilesCranmer · February 11, 2024, 3:04am

Thanks; do you mean people manually make calls via MPI.jl? I checked out MPIClusterManagers.jl but it also seems pretty broken…

It’s really weird to me that the distributed computing side of Julia is this deficient. This seems like one of the main things Julia would be really good at given the community’s focus on scientific computing… Am I missing something here? Is everybody is just batch submitting single-node Julia scripts and we among a small handful of people trying to do true multi-node stuff?

martinmestre · February 11, 2024, 3:29am

Hi Miles, I have used SlurmClusterManager.jl for 3 nodes.

But my code is not complex in parallel terms. Please post here if you find a solution.
Best

adienes · February 11, 2024, 4:06am

there is also https://github.com/ray-project/ray/pull/40098 and https://github.com/beacon-biosignals/Ray.jl which look exciting, though I have used neither

carstenbauer · February 11, 2024, 6:48am

People that really care about distributed computing, especially involving more than just two or three nodes, use MPI. It’s the industry standard and well supported in Julia through MPI.jl. The (big) drawback is that you pretty much loose all interactivity.

As I would describe it, Distributed.jl is the attempt to enable the user to do convenient and interactive small scale distributed computing. It is a non standard approach in a less explored design space. People that have been creating and working on it have focused on pushing Julia on other fronts (e.g @vchuravy has brought us pkgimages). And attracting new people to work on it has been difficult because, among other things, it has lived in the main Julia repo. Hopefully we will see more activity when it’s separate package.

MilesCranmer · February 11, 2024, 2:41pm

Thanks, this is very helpful! I will switch to MPI.jl then.

I think my mistake was assuming Distributed.jl being in the stdlib meant it was the standard Julia approach for anything involving distributed compute. (Whereas MPI.jl being a package meant it was an experiment.)

MilesCranmer · February 11, 2024, 7:02pm

Argh, I guess this means you couldn’t do distributed compute from PythonCall.jl then, right? Since you have to launch Julia via mpiexecjl? Looks like I don’t really have any good options here… All I really want is to start up Julia and execute an operation over my slurm allocation. One year ago this seemed to work with ClusterManagers but now things seem to have broken

vchuravy · February 11, 2024, 7:18pm

ClusterManagers.jl is distinct from Distributed.jl – the latter being a standard library and quite stable (it could benefit from a dedicated maintainer, but it’s stability is also a boon).

ClusterManagers.jl is an external package that provides integration with diverse schedulers and has shown itself to be challenging to maintain partly because of the diversity of schedulers and the inherent challenge of matching one code-pattern to many targets, as well as providing CI for all these different schedulers. I have come to the belief that ClusterManagers.jl should have ignored the question of requesting resources and instead should have focused on creating a Distributed.jl cluster within an allocation.

MPIClusterManagers.jl was spun out of MPI.jl and mostly ignores the question of requesting resources and using MPI within an allocation to create a Distributed.jl cluster.

Additional maintainers or competing packages to ClusterManagers would be welcome, and I particularly believe that Dagger.jl (cc: @jpsamaroo) needs support for cluster schedulers.

CameronBieganek · February 11, 2024, 8:14pm

I’m not sure exactly what your requirements are, but perhaps this old post of mine might be helpful:

I no longer work with SLURM, so I can’t really answer any questions beyond that.

MilesCranmer · February 11, 2024, 9:19pm

Thanks everyone. (Sorry to nitpick here but it is perhaps a bit misrepresentative it is called “Distributed” despite only itself handling multiprocessing on a single node — not what I would call distributed computing… but just a minor point and I digress) Edit: I am completely wrong, see @aplavin’s helpful correction below!

@CameronBieganek I might not have been clear but I have been using ClusterManagers and Distributed quite extensively for maybe 3 years now. They are the current distributed computing backend for PySR and SymbolicRegression.jl. I have in the past been able to get decent communication going over ~4 nodes x 128 cores working with ClusterManagers.jl + Distributed.jl (your example has --nodes=1 which is why you can get away with only Distributed.jl — it doesn’t need to interact with slurm).

It’s just they’ve always felt a bit hacky and unreliable, with users of my libraries complaining they couldn’t get it working on PBS clusters (turns out ClusterManagers has been broken for two years there, with no fixes on the horizon).

And now, Slurm clusters have broken too! Of course there’s always the option to fix everything myself but not exactly what I signed up for as a user

MilesCranmer · February 11, 2024, 10:14pm

At the very least I think the README could be more honest about the fact that most of the cluster managers it claims to support don’t actually work…

With its current state you get people like me spending weeks and weeks implementing support for it – thinking I will be able support all of these types of clusters – only to find out it’s actually unmaintained and broken. The README should just be honest about this, and point people to MPI.jl instead.

mkitti · February 11, 2024, 10:43pm

My experience with clusters and ClusterManagers.jl is that it vastly depends on who is submitting and maintaining the recipes. Clusters vary by quite a bit. Perhaps the most consistency I have seen is within the DOE supercomputers, but even then there are variations between the clusters.

This mostly works for me, but that’s because @bjarthur put together the LSF support. I can physically locate him with a short walk or a Zoom call.

My best advice for this is to get your cluster admins involved. I have to do the same thing for Dask, Spark, and Ray deployments. For the most part cluster admins seem to appreciation being asked rather than users trying to improvise a solution. The main time when I’ve seen this become a headache is when cluster administration has been outsourced to a third party.

MilesCranmer · February 11, 2024, 11:13pm

Update – made a PR here: Highlight lack of ongoing support on README by MilesCranmer · Pull Request #197 · JuliaParallel/ClusterManagers.jl · GitHub

Thanks for the tip @mkitti, I can try. I should mention that most of my concern is about downstream users rather than myself. I can always hack together a working solution for my own cluster, but I am more interested in keeping my downstream users happy and able to use their clusters for SymbolicRegression.jl searches. For every barrier in the way (such as needing to email their sysadmin, etc) I will end up losing a significant fraction of potential users

Topic		Replies	Views
Help setting up Julia on a cluster Julia at Scale question , parallel , cluster	28	14957	March 4, 2020
The ultimate guide to distributed computing Julia at Scale parallel , cluster , distributed	44	9918	June 21, 2021
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1399	May 19, 2022
Ask for official guide for Julia installation, deployment and use on Linux cluster Tooling question , hpc , linux , cluster , server	30	1402	September 9, 2024
Julia on Cluster with SSH Restriction General Usage question , cluster	18	3949	January 16, 2021

Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia?

Related topics