I’ve been using Julia’s parallel computing capabilities for some embarrassingly parallel computations (i.e., synchronization between workers is rare) with some success. However, given the HPC environment that I’m working with (which is cloud-based and elastic), I need it to be more resilient:
- Nodes may fail at any time, in which case I’d like the unfinished work to either get re-assigned to another worker, or at least the other workers to finish and save their work.
- Some nodes may not be configured correctly, although the cluster scheduler (SLURM) promises that they are present. This happens because they are spun up on demand and sometimes configuration fails, so for example SLURM may think the node is up and running but SSH may not be configured correctly. So if I’m adding workers on, say, 16 nodes via addprocs() and one of them doesn’t reply, I’d like everything to proceed on the other 15.
Is there a package to deal with this? I could probably make it work with some careful exception handling, but of course if there’s an existing solution that would be easier!
For Python, dask-distributed seems to handle this. Is there something comparable for Julia? I’m aware of DaskDistributedDispatcher.jl and have tried it briefly, but it didn’t seem very mature.