I have an old script that I have been using for a while, and which relies heavily on Distributed. Recently (I guess after Julia 1.11.1) my script is crashing with the following output from every node:
Worker 55 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET) Stacktrace:   [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)     @ Base ./stream.jl:410   [2] (::Base.var"#wait_locked#832")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)     @ Base ./stream.jl:972   [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)     @ Base ./stream.jl:978   [4] unsafe_read     @ ./io.jl:891 [inlined]   [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)     @ Base ./io.jl:890   [6] read!     @ ./io.jl:895 [inlined]   [7] deserialize_hdr_raw     @ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:167 [inlined]   [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)     @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:172   [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)     @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:133  [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()     @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:121
The script seems to be crashing during a random iteration, and the last error from my own code is the @sync @distributed for-loop.
Is there any way I can dig a little deeper into where things are going wrong? As of now I really don’t know where to start looking…