using Distributed
@everywhere using MyModule
@sync @distributed for x in xs
my_function!(x)
end
Which I launch using julia -p 4 --threads=auto myscript.jl. my_function writes a file to disk and doesn’t return anything.
my_function sometimes crashes, due to hard-to-predict memory allocation or other issues. When that happens, I’d prefer to retry it on the same input where it crashed. The problem is that the crash usually terminates the worker, e.g.
Worker 4 terminated.
Unhandled Task ERROR: EOFError: read end of file
So putting retry logic in my_function won’t help. Is there a straightforward way to e.g. assign the failed value to another worker and retry?
The real situation, of course, is more complex – there are dozens of workers each performing long-running jobs, and occasionally when all of them want a lot of memory one will crash. Running a single worker version never crashes, and while I’ve spent some time optimizing memory usage already, I’ve never seen a distributed system in real life that doesn’t have some spurious crashes, and require retrying.
As an example, Distributed.pmap has retrying built in, so I was wondering if there’s a nice way to do something similar with the @distributed macro.