Hi,
I’m using pmap
to distribute a computation to many workers on a cluster, pretty straightforward. Here I am using 1 node and 10 processors. Things seem to be going fine for most cases, but sometimes I get an error like the one shown below. I’m not sure about the meaning of this error. Is there an error happening for these workers (i.e. 8 or 4) that I’m not seeing? Or could something be happening at the level of the cluster manager? Other processes are finishing just fine.
Thanks!
Worker 8 terminated.
UNHANDLED TASK ERROR: EOFError: read end of file
Stacktrace:
[1] (::Base.var"#wait_locked#645")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base ./stream.jl:892
[2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base ./stream.jl:900
[3] unsafe_read
@ ./io.jl:724 [inlined]
[4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base ./io.jl:723
[5] read!
@ ./io.jl:725 [inlined]
[6] deserialize_hdr_raw
@ /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/messages.jl:167 [inlined]
[7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:165
[8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:126
[9] (::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed ./task.jl:423
Worker 4 terminated.
UNHANDLED TASK ERROR: EOFError: read end of file
Stacktrace:
[1] (::Base.var"#wait_locked#645")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base ./stream.jl:892
[2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base ./stream.jl:900
[3] unsafe_read
@ ./io.jl:724 [inlined]
[4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base ./io.jl:723
[5] read!
@ ./io.jl:725 [inlined]
[6] deserialize_hdr_raw
@ /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/messages.jl:167 [inlined]
[7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:165
[8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:126
[9] (::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed ./task.jl:423
ERROR: LoadError: ProcessExitedException(4)
Stacktrace:
[1] (::Base.var"#892#894")(x::Task)
@ Base ./asyncmap.jl:177
[2] foreach(f::Base.var"#892#894", itr::Vector{Any})
@ Base ./abstractarray.jl:2694
[3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Matrix{Tuple{Int64, Int64}})
@ Base ./asyncmap.jl:177
[4] wrap_n_exec_twice
@ ./asyncmap.jl:153 [inlined]
[5] #async_usemap#877
@ ./asyncmap.jl:103 [inlined]
[6] #asyncmap#876
@ ./asyncmap.jl:81 [inlined]
[7] pmap(f::Function, p::WorkerPool, c::Matrix{Tuple{Int64, Int64}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:126
[8] pmap(f::Function, p::WorkerPool, c::Matrix{Tuple{Int64, Int64}})
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:101
[9] pmap(f::Function, c::Matrix{Tuple{Int64, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
[10] pmap(f::Function, c::Matrix{Tuple{Int64, Int64}})
@ Distributed /gpfs/group/RISE/sw7/julia-1.7.0/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/pmap.jl:156
[11] top-level scope
@ /storage/work/j/user/Julia/Potential_v2/bin/phase_diagram_mesh_tol.jl:24
in expression starting at /storage/work/user/Julia/Potential_v2/bin/phase_diagram_mesh_tol.jl:24
┌ Warning: Forcibly interrupting busy workers
│ exception = rmprocs: pids [2, 3, 5, 6, 7, 9, 10, 11] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1249
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1045