Failure-resilient parallel computing

belab · November 23, 2018, 9:44pm

Hi,

I’ve been using Julia’s parallel computing capabilities for some embarrassingly parallel computations (i.e., synchronization between workers is rare) with some success. However, given the HPC environment that I’m working with (which is cloud-based and elastic), I need it to be more resilient:

Nodes may fail at any time, in which case I’d like the unfinished work to either get re-assigned to another worker, or at least the other workers to finish and save their work.
Some nodes may not be configured correctly, although the cluster scheduler (SLURM) promises that they are present. This happens because they are spun up on demand and sometimes configuration fails, so for example SLURM may think the node is up and running but SSH may not be configured correctly. So if I’m adding workers on, say, 16 nodes via addprocs() and one of them doesn’t reply, I’d like everything to proceed on the other 15.

Is there a package to deal with this? I could probably make it work with some careful exception handling, but of course if there’s an existing solution that would be easier!

For Python, dask-distributed seems to handle this. Is there something comparable for Julia? I’m aware of DaskDistributedDispatcher.jl and have tried it briefly, but it didn’t seem very mature.

Thanks,
Bela

Balinus · November 23, 2018, 11:06pm

There is Dagger.jl. I haven’t worked with it though, so can’t say if it’s a solution to your problem but they advertise it as Dask-inspired.

johnh · November 24, 2018, 9:07am

@belab You ask very interesting questions! Normal MPI based parallel programs do not cope with failures. I believe there is work underway in OpenMPI to cope with node failures. When there are exascale HPC facilities with huge numbers of compute nodes yes these problems will have to be tackled.
Also look at Flock of Birds computing - though I cannot find the concept on Google.

johnh · November 24, 2018, 9:11am

I guess at the moment one way to cope with this would be to have a part of your script before the main task. Take the list of compute nodes which is $SLURM_JOB_NODELIST
Loop through these nodes and make sure each one responds.
Then only use those which are up. That is a bit of a hack.

You of course say some careful exception handlign - so you have thought about this.

mauro3 · November 24, 2018, 1:04pm

pmap has an on_error kwarg. That should let you save your failed runs and then re-run them. Not very sophisticated but maybe enough?

belab · November 25, 2018, 5:09am

Dagger.jl looks interesting for some of what dask does (although the dependency graph is actually trivial in my case), but I didn’t find anything on whether it has built-in features for handling node failures etc.

belab · November 25, 2018, 5:09am

Thanks everyone for your suggestions!

Yes, it’s those kinds of “hacks” that I’ve started doing, but I was hoping there’s a more elegant solution.

I’ve used pmap’s on_error argument before, but it doesn’t solve everything I need (for example when addprocs fials), and the code I’m working on right now doesn’t use pmap.

Bela

Topic		Replies	Views
Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia? General Usage question , package	44	2133	July 13, 2024
How to parallel Julia on multiple nodes on HPC (slurm)? Julia at Scale question	11	3585	May 20, 2020
Julia package for workflow management General Usage package	5	469	September 18, 2023
How to get started with distributed memory parallel programming? New to Julia	3	695	June 9, 2021
Fault tolerant `pmap` when worker goes down General Usage parallel	4	1480	November 2, 2020

Failure-resilient parallel computing

Related topics