Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia?

Out of the box, it does support running processes on multiple machines accessible through ssh connections. No special setup necessary.

4 Likes

Thanks, I stand corrected. I literally had no idea you could do that with addprocs


Very cool! Is this a (fairly) recent addition or maybe I somehow completely missed it when I was starting out a few years ago?

Maybe this is what everybody has been using and I’m over here pulling my hair out trying to get ClusterManagers to work…

I’m still confused about who you expect to fix this?

Without communication with a local cluster admin, I do not have any idea how to fix this or more importantly keep this working. Maybe I could spin up an AWS Parallel Cluster configured with slurm. Even if I get it to work there, I’m not sure if it would work on your cluster.

I find it pretty suspicious that your issue report suggests that it used to work and then stopped working. What changed? I would start there.

This is potentially part of the problem. If we do not involve cluster administration and just hack up solutions that work just for us we will not be able to establish sustainable solutions to these issues.

Some universities and institutions have dedicated Research Software Engineering or Scientific Computing groups that are meant to do just that. Please involve them. From an academic perspective, that’s really the only way resources are going to get allocated to this problem. This is ultimately how I have a job, and how I’m paid to get this sort of thing to work at my institution. I’m fortunate in that I currently do have principal investigators who give me enough freedom to pick my tools or otherwise encourage me to use Julia. In the past, I’ve been very limited in ability to do this because I’ve had investigators insist on MATLAB. In those cases, we built entire clusters around MATLAB.

Ultimately, this is really a local problem. ClusterManagers.jl is not really doable with a single maintainer. Rather each scheduler might have a distinct maintainer that is using it at their institution or institutions.

The other solution is of course a commercial provider. JuliaHub’s platform caters to this exact problem. Setting up and configuring clusters are hard. JuliaHub will gladly provide you with a cluster that just works, for a fee. That seems to be their most successful business model, hence the renaming from Julia Computing to JuliaHub.

Another path is hiring a consultant. @schlichtanders 's Jolin.io, a JuliaCon sponsor, lists “setting up HPC clusters” literally as a item #1 of his consultancy’s service offering. Hiring a consultant is a bit difficult in academia for some reason, but in your case I really kind of like this idea. If you have a consultant that is familiar with your problem and software, you could refer them to your downstream users.

This ends up going the other way as well. For every sysadmin not emailed, we have one less sysadmin who cares about Julia. That in turn means many less users who will have a solution that works sustainably on their cluster.

I think that’s been around since Julia 1.0:

https://docs.julialang.org/en/v1.0/stdlib/Distributed/#Distributed.addprocs

This feature uses SSHManager, one of the two builtin cluster managers.

https://docs.julialang.org/en/v1/manual/distributed-computing/#Network-Requirements-for-LocalManager-and-SSHManager

I’m not very convinced discovering SSHManager is actually a solution to your problem. This is just the foundation upon which cluster manager packages such as ClusterManagers.jl and SlurmClusterManagers.jl build. Ultimately all cluster managers tries to do is to start a job so that there is a remote computer to establish a SSH connection with or otherwise setup some means of communication. The Slurm issue you reported seems to be someone changing how internode communication works on your cluster.

5 Likes

In terms of fixes I just think the README should be clear that ClusterManagers.jl is unmaintained/untested rather than listing the cluster software as “currently supported.” My current plan is to just try out MPI.jl though.

I guess I just missed it, mea culpa.

For what it is worth, I did kick off this PR: [WIP] Fork `dask-jobqueue`'s testing suite images by MilesCranmer · Pull Request #193 · JuliaParallel/ClusterManagers.jl · GitHub in the hope someone could finish it off. No progress though unfortunately.

The idea was to make it so that a maintainer could actually test the cluster management software over many versions instead of being stuck with “I don’t have slurm” issue responses and not being able to debug things.

Good point!

I mean, Python is pretty good on this front… Dask, Ray, PyTorch, JAX, and many other frameworks are excellent at this type of thing, without much cluster-specific customization needed (note that my PR is based on dask’s CI suite – they actually made a way to unittest many different cluster softwares). FWIW I’ve never needed to email my sysadmin for a PyTorch Lightning multi-node neural net training job. So it’s definitely possible.

I agree. Actually it sounds like it is closer to zero at the moment. I really just think the README should just be honest about this though…

1 Like

For each of these framework cases we have locally added hooks, settings, or otherwise modified our local cluster to work with those frameworks. Much of this information is behind an intranet because it is deemed a security risk to expose it. These exist because someone at our institute asked us to make it work well. Perhaps the other tools work well because someone else already asked?

The next CI call is tomorrow February 12th. The question here is if it would make sense to integrate this into Julia Buildkite.

The next Julia HPC call is scheduled for Feburary 27th:

Information for both can be found on the community calendar:

This seems a little unfair, as your PRs and issues are getting attention/responses. That said, I empathize, and as a “maintainer”, I apologize (TBH I didn’t know how I ended up in that position, other than for a time I was interested in using the package and had access to a SLURM cluster).

2 Likes

Ah, didn’t realize you were the maintainer – I thought you were another person who (like me) was using it and hanging out in the issues/PRs. I certainly appreciate you taking on that role!

Yeah I think the most urgent issue is making a test suite – [WIP] Fork `dask-jobqueue`'s testing suite images by MilesCranmer · Pull Request #193 · JuliaParallel/ClusterManagers.jl · GitHub. Once that is in it will be easier for you/others to test what is broken. The clearest breakage right now is the PBS one from a couple of years ago: pbs error · Issue #179 · JuliaParallel/ClusterManagers.jl · GitHub due to an API update. But I think the test suite would help with this as it would be clear if it is working or not.

All that I can attest to is that we use LSFManager with some frequency, and it still seems to work.

It might actually be useful to have a column in that table for “broken/working” based on user tests/issues (until that dask-jobqueue testing suite is implemented).

I used addprocs_sge a while ago and it worked.

Perhaps, but here’s the critical question. Broken for whom or what? Just because it’s broken for you does not necessarily mean it’s broken for everyone.

1 Like

Not sure about slurm, but PBS is definitely broken for everyone (unless they’re on an old version), since the command line arguments are incorrect on v20+.

My main point is just that the README needs to be clearer that these cluster managers are not all “Currently supported”… So people don’t repeat my mistake of assuming it is a catch-all for downstream user clusters

1 Like

Guess I should stick my head above the parapet here…
In the past I have worked with PBS, Slurm and SGE.
I don;t have access to a PBS clusterr at the moment though, but could maybe run up a cloud instance or use some VMs.

I think it is time I put my money where my mouth is on Julia HPC.

9 Likes

I mean, the latter is basically true, except someone gave me write access. I can give you the bit, and then you’d be as much a maintainer as me :sweat_smile:. EDIT: I see now that Valentine already offered this

Totally agree - it seemed like @vchuravy and @jling were helping out here, I’m not clear why it stalled out. @jling do you just need write access?

1 Like

I’m not a ClusterManagers maintainer, but I’m currently working on fixes and improvements for the SlurmManager. I’ll do a PR to upstream once we’re tested them in the field with real-live applications.

6 Likes

Ok, here are my current fixes:

seems to work well in practice so far.

Following advice from @jblaschke I’m looking into ClusterManagers.ElasticManager as an alternative now, it uses socket communication during worker startup, instead of the main process reading a connection-info output file from each worker.

7 Likes

This is great! Thanks @oschulz !!!

Thanks @oschulz! After reading this thread I decided to give ClusterManagers.jl another shot (for slurm), and through your patch was able to see that addprocs times out if it can’t connect within 60 seconds.

I was able to fix this by running the Julia instance with JULIA_WORKER_TIMEOUT=150 julia and now it’s working just fine!

4 Likes

Yes, increasing JULIA_WORKER_TIMEOUT helps. Also, the default waiting time backoff scheme is not ideal (maybe we should change that too), I use retry_delays = [1, 1, 2, 2, 4, 5, 5, 10, 10, 10, 10, 20, 20, 20] for SlurmManager at the moment, to check for worker connection files more frequently.

Regarding the OP question on maintenance of ClusterManagers.jl: I got my PR approved and merged very quickly (thanks @kevbonham and @jling).

6 Likes