@distributed retry

Satvik · March 29, 2022, 9:58pm

I have code that looks roughly like this:

using Distributed
@everywhere using MyModule
@sync @distributed for x in xs
    my_function!(x)
end

Which I launch using julia -p 4 --threads=auto myscript.jl. my_function writes a file to disk and doesn’t return anything.

my_function sometimes crashes, due to hard-to-predict memory allocation or other issues. When that happens, I’d prefer to retry it on the same input where it crashed. The problem is that the crash usually terminates the worker, e.g.

Worker 4 terminated.
Unhandled Task ERROR: EOFError: read end of file

So putting retry logic in my_function won’t help. Is there a straightforward way to e.g. assign the failed value to another worker and retry?

goerch · March 29, 2022, 10:04pm

To be honest, I’ve seen better MWE’s;) Is it the same file all workers write (that would explain a lot)?

Maybe you should analyze these issues first? In any case you could alleviate these problems with try-catch-handling in the worker?

Satvik · March 29, 2022, 10:16pm

No, they all output different files.

The real situation, of course, is more complex – there are dozens of workers each performing long-running jobs, and occasionally when all of them want a lot of memory one will crash. Running a single worker version never crashes, and while I’ve spent some time optimizing memory usage already, I’ve never seen a distributed system in real life that doesn’t have some spurious crashes, and require retrying.

As an example, Distributed.pmap has retrying built in, so I was wondering if there’s a nice way to do something similar with the @distributed macro.

goerch · March 29, 2022, 10:28pm

OK, this sounds like a design problem.

This sounds reasonable: in a large enough cluster some node could randomly fail.

OK. here is the source. @distributed seems to be using pfor which has some error_monitor, but seemingly not the same retry logic like pmap.

Topic		Replies	Views
@distributed fails for many workers? Julia at Scale	21	1648	May 14, 2019
Using @distributed for loop New to Julia question , loops , parallel-computing	10	323	August 5, 2024
@Distributed: On worker 2 UndefVarError {{Module}} General Usage question	8	2840	July 14, 2020
Help for basic usage of @distributed for Julia at Scale question	1	500	October 1, 2020
Fault tolerant `pmap` when worker goes down General Usage parallel	4	1476	November 2, 2020

@distributed retry

Related topics