Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia?

aplavin · February 12, 2024, 12:38am

Out of the box, it does support running processes on multiple machines accessible through ssh connections. No special setup necessary.

MilesCranmer · February 12, 2024, 12:57am

Thanks, I stand corrected. I literally had no idea you could do that with addprocs…

Very cool! Is this a (fairly) recent addition or maybe I somehow completely missed it when I was starting out a few years ago?

Maybe this is what everybody has been using and I’m over here pulling my hair out trying to get ClusterManagers to work…

mkitti · February 12, 2024, 1:20am

I’m still confused about who you expect to fix this?

Without communication with a local cluster admin, I do not have any idea how to fix this or more importantly keep this working. Maybe I could spin up an AWS Parallel Cluster configured with slurm. Even if I get it to work there, I’m not sure if it would work on your cluster.

I find it pretty suspicious that your issue report suggests that it used to work and then stopped working. What changed? I would start there.

This is potentially part of the problem. If we do not involve cluster administration and just hack up solutions that work just for us we will not be able to establish sustainable solutions to these issues.

Some universities and institutions have dedicated Research Software Engineering or Scientific Computing groups that are meant to do just that. Please involve them. From an academic perspective, that’s really the only way resources are going to get allocated to this problem. This is ultimately how I have a job, and how I’m paid to get this sort of thing to work at my institution. I’m fortunate in that I currently do have principal investigators who give me enough freedom to pick my tools or otherwise encourage me to use Julia. In the past, I’ve been very limited in ability to do this because I’ve had investigators insist on MATLAB. In those cases, we built entire clusters around MATLAB.

Ultimately, this is really a local problem. ClusterManagers.jl is not really doable with a single maintainer. Rather each scheduler might have a distinct maintainer that is using it at their institution or institutions.

The other solution is of course a commercial provider. JuliaHub’s platform caters to this exact problem. Setting up and configuring clusters are hard. JuliaHub will gladly provide you with a cluster that just works, for a fee. That seems to be their most successful business model, hence the renaming from Julia Computing to JuliaHub.

Another path is hiring a consultant. @schlichtanders 's Jolin.io, a JuliaCon sponsor, lists “setting up HPC clusters” literally as a item #1 of his consultancy’s service offering. Hiring a consultant is a bit difficult in academia for some reason, but in your case I really kind of like this idea. If you have a consultant that is familiar with your problem and software, you could refer them to your downstream users.

This ends up going the other way as well. For every sysadmin not emailed, we have one less sysadmin who cares about Julia. That in turn means many less users who will have a solution that works sustainably on their cluster.

I think that’s been around since Julia 1.0:

https://docs.julialang.org/en/v1.0/stdlib/Distributed/#Distributed.addprocs

This feature uses SSHManager, one of the two builtin cluster managers.

https://docs.julialang.org/en/v1/manual/distributed-computing/#Network-Requirements-for-LocalManager-and-SSHManager

I’m not very convinced discovering SSHManager is actually a solution to your problem. This is just the foundation upon which cluster manager packages such as ClusterManagers.jl and SlurmClusterManagers.jl build. Ultimately all cluster managers tries to do is to start a job so that there is a remote computer to establish a SSH connection with or otherwise setup some means of communication. The Slurm issue you reported seems to be someone changing how internode communication works on your cluster.

MilesCranmer · February 12, 2024, 1:59am

In terms of fixes I just think the README should be clear that ClusterManagers.jl is unmaintained/untested rather than listing the cluster software as “currently supported.” My current plan is to just try out MPI.jl though.

I guess I just missed it, mea culpa.

For what it is worth, I did kick off this PR: [WIP] Fork `dask-jobqueue`'s testing suite images by MilesCranmer · Pull Request #193 · JuliaParallel/ClusterManagers.jl · GitHub in the hope someone could finish it off. No progress though unfortunately.

The idea was to make it so that a maintainer could actually test the cluster management software over many versions instead of being stuck with “I don’t have slurm” issue responses and not being able to debug things.

Good point!

I mean, Python is pretty good on this front… Dask, Ray, PyTorch, JAX, and many other frameworks are excellent at this type of thing, without much cluster-specific customization needed (note that my PR is based on dask’s CI suite – they actually made a way to unittest many different cluster softwares). FWIW I’ve never needed to email my sysadmin for a PyTorch Lightning multi-node neural net training job. So it’s definitely possible.

I agree. Actually it sounds like it is closer to zero at the moment. I really just think the README should just be honest about this though…

mkitti · February 12, 2024, 3:09am

For each of these framework cases we have locally added hooks, settings, or otherwise modified our local cluster to work with those frameworks. Much of this information is behind an intranet because it is deemed a security risk to expose it. These exist because someone at our institute asked us to make it work well. Perhaps the other tools work well because someone else already asked?

The next CI call is tomorrow February 12th. The question here is if it would make sense to integrate this into Julia Buildkite.

The next Julia HPC call is scheduled for Feburary 27th:

Information for both can be found on the community calendar:

kevbonham · February 12, 2024, 3:20am

This seems a little unfair, as your PRs and issues are getting attention/responses. That said, I empathize, and as a “maintainer”, I apologize (TBH I didn’t know how I ended up in that position, other than for a time I was interested in using the package and had access to a SLURM cluster).

MilesCranmer · February 12, 2024, 3:30am

Ah, didn’t realize you were the maintainer – I thought you were another person who (like me) was using it and hanging out in the issues/PRs. I certainly appreciate you taking on that role!

Yeah I think the most urgent issue is making a test suite – [WIP] Fork `dask-jobqueue`'s testing suite images by MilesCranmer · Pull Request #193 · JuliaParallel/ClusterManagers.jl · GitHub. Once that is in it will be easier for you/others to test what is broken. The clearest breakage right now is the PBS one from a couple of years ago: pbs error · Issue #179 · JuliaParallel/ClusterManagers.jl · GitHub due to an API update. But I think the test suite would help with this as it would be clear if it is working or not.

mkitti · February 12, 2024, 3:32am

All that I can attest to is that we use LSFManager with some frequency, and it still seems to work.

MilesCranmer · February 12, 2024, 3:36am

It might actually be useful to have a column in that table for “broken/working” based on user tests/issues (until that dask-jobqueue testing suite is implemented).

jar1 · February 12, 2024, 3:38am

I used addprocs_sge a while ago and it worked.

mkitti · February 12, 2024, 3:38am

Perhaps, but here’s the critical question. Broken for whom or what? Just because it’s broken for you does not necessarily mean it’s broken for everyone.

MilesCranmer · February 12, 2024, 3:49am

Not sure about slurm, but PBS is definitely broken for everyone (unless they’re on an old version), since the command line arguments are incorrect on v20+.

My main point is just that the README needs to be clearer that these cluster managers are not all “Currently supported”… So people don’t repeat my mistake of assuming it is a catch-all for downstream user clusters

johnh · February 12, 2024, 11:00am

Guess I should stick my head above the parapet here…
In the past I have worked with PBS, Slurm and SGE.
I don;t have access to a PBS clusterr at the moment though, but could maybe run up a cloud instance or use some VMs.

I think it is time I put my money where my mouth is on Julia HPC.

kevbonham · February 12, 2024, 11:30am

I mean, the latter is basically true, except someone gave me write access. I can give you the bit, and then you’d be as much a maintainer as me . EDIT: I see now that Valentine already offered this

Totally agree - it seemed like @vchuravy and @jling were helping out here, I’m not clear why it stalled out. @jling do you just need write access?

oschulz · April 15, 2024, 9:54am

I’m not a ClusterManagers maintainer, but I’m currently working on fixes and improvements for the SlurmManager. I’ll do a PR to upstream once we’re tested them in the field with real-live applications.

oschulz · April 19, 2024, 8:20am

Ok, here are my current fixes:

github.com/JuliaParallel/ClusterManagers.jl

Make SLUM worker startup more robust and provide more feedback

JuliaParallel:master ← oschulz:slurm-improvements

opened 08:16AM - 19 Apr 24 UTC

oschulz

+62 -36

Builds on top of #199. Before (but with fix in #199): ``` # ... wait (but… what's going on?) ... connecting to worker 1 out of 12 connecting to worker 2 out of 12 connecting to worker 3 out of 12 connecting to worker 4 out of 12 connecting to worker 5 out of 12 connecting to worker 6 out of 12 connecting to worker 7 out of 12 connecting to worker 8 out of 12 connecting to worker 9 out of 12 connecting to worker 10 out of 12 connecting to worker 11 out of 12 connecting to worker 12 out of 12 ``` After: ``` [ Info: Starting SLURM job julia-26323452: `srun -J julia-26323452 -n 12 -D /homedir/some/dir --cpus-per-task=8 --mem-per-cpu=8G --cpu-bind=cores --mem-bind=local -o /homedir/slurm-julia-output/julia-26323452-12983479872-%4t.out /path/to/bin/julia --project=/homedir/.julia/environments/someenv --threads=8 --heap-size-hint=34359738368 --worker=qy8ZReqHiDfwjq6a` [ Info: Worker 0 (after 0 s): No output file "/homedir/slurm-julia-output/julia-26323452-12983479872-0000.out" yet [ Info: Worker 0 (after 1 s): Output file found, but no connection details yet [ Info: Worker 0 (after 2 s): Output file found, but no connection details yet [ Info: Worker 0 (after 4 s): Output file found, but no connection details yet [ Info: Worker 0 (after 6 s): Output file found, but no connection details yet [ Info: Worker 0 ready after 10 s on host 149.217.13.126, port 9101 [ Info: Worker 1 ready after 10 s on host 149.217.13.126, port 9102 [ Info: Worker 2 ready after 10 s on host 149.217.13.126, port 9103 [ Info: Worker 3 ready after 11 s on host 149.217.13.126, port 9104 [ Info: Worker 4 ready after 11 s on host 149.217.13.126, port 9105 [ Info: Worker 5 ready after 11 s on host 149.217.13.126, port 9106 [ Info: Worker 6 ready after 12 s on host 149.217.13.126, port 9107 [ Info: Worker 7 ready after 12 s on host 149.217.13.126, port 9108 [ Info: Worker 8 ready after 12 s on host 149.217.13.126, port 9109 [ Info: Worker 9 ready after 12 s on host 149.217.13.126, port 9110 [ Info: Worker 10 ready after 12 s on host 149.217.13.126, port 9111 [ Info: Worker 11 ready after 12 s on host 149.217.13.126, port 9112 ```

seems to work well in practice so far.

Following advice from @jblaschke I’m looking into ClusterManagers.ElasticManager as an alternative now, it uses socket communication during worker startup, instead of the main process reading a connection-info output file from each worker.

vancleve · April 19, 2024, 6:47pm

This is great! Thanks @oschulz !!!

RomeoV · April 20, 2024, 7:42am

Thanks @oschulz! After reading this thread I decided to give ClusterManagers.jl another shot (for slurm), and through your patch was able to see that addprocs times out if it can’t connect within 60 seconds.

I was able to fix this by running the Julia instance with JULIA_WORKER_TIMEOUT=150 julia and now it’s working just fine!

oschulz · April 20, 2024, 12:45pm

Yes, increasing JULIA_WORKER_TIMEOUT helps. Also, the default waiting time backoff scheme is not ideal (maybe we should change that too), I use retry_delays = [1, 1, 2, 2, 4, 5, 5, 10, 10, 10, 10, 20, 20, 20] for SlurmManager at the moment, to check for worker connection files more frequently.

oschulz · April 20, 2024, 12:49pm

Regarding the OP question on maintenance of ClusterManagers.jl: I got my PR approved and merged very quickly (thanks @kevbonham and @jling).

Topic		Replies	Views
Using multi-node Distributed.jl in a slurm cluster Julia at Scale	1	174	January 29, 2025
[Ann] julia in parallel batch mode: job schedulers, etc Julia at Scale announcement	2	1737	November 26, 2018
How to get started with distributed memory parallel programming? New to Julia	3	689	June 9, 2021
Getting started with distributed Julia computations on a cluster Julia at Scale	1	575	September 27, 2020
Distributed.jl, DistributedArray.jl with InfiniBand cluster Julia at Scale cluster , distributed	19	1383	September 7, 2022

Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia?

Related topics