Fault tolerant `pmap` when worker goes down

parallel
#1

Is there a way to make pmap fault tolerant to procs going down? This example shows that the on_error option does not capture this:

julia> using Distributed                                                       
                                                      
julia> addprocs(2)                                                
2-element Array{Int64,1}:                                        
 2                                                                                     
 3                                                                                       
                                                                                                                                                                        
julia> pmap(x->(sleep(10); 0), 1:3, on_error=x->-1)                        
      From worker 2:                                                                                                                                                    
      From worker 2:    signal (15): TerminatedWorker 2 terminated.
                                                                   
      From worker 2:    in expression starting at no file:0ERROR:                                                                                                       
ProcessExitedException()      From worker 2:    epoll_pwait at /usr/lib/libc.so.6 (unknown line)                                                                        
Stacktrace:                                                                    
....

is triggered when killing one of the worker processes during the pmap.

This closed issue https://github.com/JuliaLang/julia/issues/217 suggests to me that it should work. Am I missing something?

0 Likes

#2

Answering myself: use retry_delay as in pmap(x->(sleep(10); 0), 1:3, on_error=x->-1, retry_delays = zeros(3)). Then killing one proc makes another then do the work. And apparently, killing all worker-procs makes the process 1 do it.

5 Likes

OOM Killer on Linux