Worker is terminated with "connection reset by peer" after activating environment and using packages

I’m working on a Windows laptop, running Julia in VSCode. I run the following code to set up distributed processors and packages:

using Distributed 
addprocs(3)
@everywhere path = "C:/Users/User/Documents/bounds"

@everywhere using Pkg 
@everywhere Pkg.activate(path)
using BSON 
@everywhere using JuMP 
@everywhere using Ipopt 

I did quite a few runs with no problems, but the last few times I have tried to run I get an error with the notification that one or more of the workers have terminated due to β€œunhandled task error/connection reset by peer.” I read somewhere that this could be caused by memory overflow in one of the processors, but I get the error even if I don’t run any code after the above which simply activates packages.

The error printout is shown below:

PS C:\Users\User\Documents\bounds> julia .\src\mp_test.jl
  Activating project at `C:\Users\User\Documents\bounds`
      From worker 4:      Activating project at `C:\Users\User\Documents\bounds`
      From worker 2:      Activating project at `C:\Users\User\Documents\bounds`
      From worker 3:      Activating project at `C:\Users\User\Documents\bounds`
Worker 3 terminated.
Worker 4 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:953
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined] 
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484
β”Œ Error: Error during package callback
β”‚   exception =
β”‚    1-element ExceptionStack:
β”‚    ProcessExitedException(2)
β”‚ 
β”‚    ...and 2 more exceptions.
β”‚ 
β”‚    Stacktrace:
β”‚      [1] sync_end(c::Channel{Any})
β”‚        @ Base .\task.jl:436
β”‚      [2] macro expansion
β”‚        @ .\task.jl:455 [inlined]
β”‚      [3] _require_callback(mod::Base.PkgId)
β”‚        @ Distributed C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\Distributed.jl:77
β”‚      [4] #invokelatest#2
β”‚        @ .\essentials.jl:729 [inlined]
β”‚      [5] invokelatest
β”‚        @ .\essentials.jl:726 [inlined]
β”‚      [6] run_package_callbacks(modkey::Base.PkgId)
β”‚        @ Base .\loading.jl:869
β”‚      [7] _require_prelocked(uuidkey::Base.PkgId)
β”‚        @ Base .\loading.jl:1206
β”‚      [8] macro expansion
β”‚        @ .\loading.jl:1180 [inlined]
β”‚      [9] macro expansion
β”‚        @ .\lock.jl:223 [inlined]
β”‚     [10] require(into::Module, mod::Symbol)
β”‚        @ Base .\loading.jl:1144
β”‚     [11] top-level scope
β”‚        @ C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\macros.jl:200        
β”‚     [12] include(mod::Module, _path::String)
β”‚        @ Base .\Base.jl:419
β”‚     [13] exec_options(opts::Base.JLOptions)
β”‚        @ Base .\client.jl:303
β”‚     [14] _start()
β”‚        @ Base .\client.jl:522
β”” @ Base loading.jl:874

Worker 2 terminated.Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base .\stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base .\stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base .\stream.jl:953
  [4] unsafe_read
    @ .\io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base .\io.jl:758
  [6] read!
    @ .\io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined] 
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed C:\Users\User\AppData\Local\Programs\Julia-1.8.2\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed .\task.jl:484

Does anyone know what might be causing this error?

Unfortunately, errors like these are hard to track down and resolve without a consistent minimal working example (MWE). If you are able to reproduce this on another system with the same piece of code or the same setup, then that might make it easier to investigate.

Do you happen to know which line of mp_test.jl throws this error, or about how long it takes before it occurs?

After restarting my computer, I have not seen the error again. I’m not sure why it happened in the first place, but it seems like it was a one-off thing and it’s working just fine now.

Side comment: Since Julia 1.9, workers should inherit the Pkg environment (PR that fixed the issue), so you don’t need to use Pkg.activate on each worker anymore (as long as you are working in the desired environment already on the master, i.e. the active REPL process).

BTW, I saw that you only joined recently, so welcome to the Julia discourse :slight_smile:

2 Likes

That’s very good to know. Thank you!