Failure-resilient parallel computing

Hi,

I’ve been using Julia’s parallel computing capabilities for some embarrassingly parallel computations (i.e., synchronization between workers is rare) with some success. However, given the HPC environment that I’m working with (which is cloud-based and elastic), I need it to be more resilient:

  • Nodes may fail at any time, in which case I’d like the unfinished work to either get re-assigned to another worker, or at least the other workers to finish and save their work.
  • Some nodes may not be configured correctly, although the cluster scheduler (SLURM) promises that they are present. This happens because they are spun up on demand and sometimes configuration fails, so for example SLURM may think the node is up and running but SSH may not be configured correctly. So if I’m adding workers on, say, 16 nodes via addprocs() and one of them doesn’t reply, I’d like everything to proceed on the other 15.

Is there a package to deal with this? I could probably make it work with some careful exception handling, but of course if there’s an existing solution that would be easier!

For Python, dask-distributed seems to handle this. Is there something comparable for Julia? I’m aware of DaskDistributedDispatcher.jl and have tried it briefly, but it didn’t seem very mature.

Thanks,
Bela

There is Dagger.jl. I haven’t worked with it though, so can’t say if it’s a solution to your problem but they advertise it as Dask-inspired.

@belab You ask very interesting questions! Normal MPI based parallel programs do not cope with failures. I believe there is work underway in OpenMPI to cope with node failures. When there are exascale HPC facilities with huge numbers of compute nodes yes these problems will have to be tackled.
Also look at Flock of Birds computing - though I cannot find the concept on Google.

I guess at the moment one way to cope with this would be to have a part of your script before the main task. Take the list of compute nodes which is $SLURM_JOB_NODELIST
Loop through these nodes and make sure each one responds.
Then only use those which are up. That is a bit of a hack.

You of course say some careful exception handlign - so you have thought about this.

pmap has an on_error kwarg. That should let you save your failed runs and then re-run them. Not very sophisticated but maybe enough?

Dagger.jl looks interesting for some of what dask does (although the dependency graph is actually trivial in my case), but I didn’t find anything on whether it has built-in features for handling node failures etc.

Thanks everyone for your suggestions!

Yes, it’s those kinds of “hacks” that I’ve started doing, but I was hoping there’s a more elegant solution.

I’ve used pmap’s on_error argument before, but it doesn’t solve everything I need (for example when addprocs fials), and the code I’m working on right now doesn’t use pmap.

Bela