Unhandled task error using pmap

Hi,

I’m using pmap to distribute a computation to many workers on a cluster, pretty straightforward. Here I am using 1 node and 10 processors. Things seem to be going fine for most cases, but sometimes I get an error like the one shown below. I’m not sure about the meaning of this error. Is there an error happening for these workers (i.e. 8 or 4) that I’m not seeing? Or could something be happening at the level of the cluster manager? Other processes are finishing just fine.

Thanks!

Worker 8 terminated.
UNHANDLED TASK ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#645")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:892
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:900
 [3] unsafe_read
   @ ./io.jl:724 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:723
 [5] read!
   @ ./io.jl:725 [inlined]
 [6] deserialize_hdr_raw
   @ /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:165
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:126
 [9] (::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:423
Worker 4 terminated.
UNHANDLED TASK ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#645")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:892
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:900
 [3] unsafe_read
   @ ./io.jl:724 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:723
 [5] read!
   @ ./io.jl:725 [inlined]
 [6] deserialize_hdr_raw
   @ /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:165
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:126
 [9] (::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:423
ERROR: LoadError: ProcessExitedException(4)
Stacktrace:
  [1] (::Base.var"#892#894")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#892#894", itr::Vector{Any})
    @ Base ./abstractarray.jl:2694
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Matrix{Tuple{Int64, Int64}})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] #async_usemap#877
    @ ./asyncmap.jl:103 [inlined]
  [6] #asyncmap#876
    @ ./asyncmap.jl:81 [inlined]
  [7] pmap(f::Function, p::WorkerPool, c::Matrix{Tuple{Int64, Int64}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
    @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:126
  [8] pmap(f::Function, p::WorkerPool, c::Matrix{Tuple{Int64, Int64}})
    @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:101
  [9] pmap(f::Function, c::Matrix{Tuple{Int64, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
 [10] pmap(f::Function, c::Matrix{Tuple{Int64, Int64}})
    @ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
 [11] top-level scope
    @ /storage/work/j/user/Julia/Potential_v2/bin/phase_diagram_mesh_tol.jl:24
in expression starting at /storage/work/user/Julia/Potential_v2/bin/phase_diagram_mesh_tol.jl:24
┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [2, 3, 5, 6, 7, 9, 10, 11] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1249
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1045
2 Likes

I get a similar EOFError from process_tcp_streams.

@poopsilon Have you found a solution what is going on?