@distributed retry

I have code that looks roughly like this:

using Distributed
@everywhere using MyModule
@sync @distributed for x in xs
    my_function!(x)
end

Which I launch using julia -p 4 --threads=auto myscript.jl. my_function writes a file to disk and doesn’t return anything.

my_function sometimes crashes, due to hard-to-predict memory allocation or other issues. When that happens, I’d prefer to retry it on the same input where it crashed. The problem is that the crash usually terminates the worker, e.g.

Worker 4 terminated.
Unhandled Task ERROR: EOFError: read end of file

So putting retry logic in my_function won’t help. Is there a straightforward way to e.g. assign the failed value to another worker and retry?

To be honest, I’ve seen better MWE’s;) Is it the same file all workers write (that would explain a lot)?

Maybe you should analyze these issues first? In any case you could alleviate these problems with try-catch-handling in the worker?

No, they all output different files.

The real situation, of course, is more complex – there are dozens of workers each performing long-running jobs, and occasionally when all of them want a lot of memory one will crash. Running a single worker version never crashes, and while I’ve spent some time optimizing memory usage already, I’ve never seen a distributed system in real life that doesn’t have some spurious crashes, and require retrying.

As an example, Distributed.pmap has retrying built in, so I was wondering if there’s a nice way to do something similar with the @distributed macro.

OK, this sounds like a design problem.

This sounds reasonable: in a large enough cluster some node could randomly fail.

OK. here is the source. @distributed seems to be using pfor which has some error_monitor, but seemingly not the same retry logic like pmap.