I have an old script that I have been using for a while, and which relies heavily on Distributed. Recently (I guess after Julia 1.11.1) my script is crashing with the following output from every node:
Worker 55 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET) Stacktrace: [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64) @ Base ./stream.jl:410 [2] (::Base.var"#wait_locked#832")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64) @ Base ./stream.jl:972 [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64) @ Base ./stream.jl:978 [4] unsafe_read @ ./io.jl:891 [inlined] [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64) @ Base ./io.jl:890 [6] read! @ ./io.jl:895 [inlined] [7] deserialize_hdr_raw @ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:167 [inlined] [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:172 [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:133 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})() @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:121
The script seems to be crashing during a random iteration, and the last error from my own code is the @sync @distributed for-loop.
Is there any way I can dig a little deeper into where things are going wrong? As of now I really don’t know where to start looking…