`addprocs` crashes with `connection refused (ECONNREFUSED)`

I ssh’d into a machine that is (afaik) not virtualized or anything and ran

JULIA_EXCLUSIVE=1 julia-1.6.2/bin/julia --threads 128 langevin/langevin.jl &> out.txt &

and it crashed with the output below.

langevin.jl starts with

ENV["JULIA_DEBUG"] = "all"
using Distributed
addprocs(128)
┌ Debug: Creating new cache for "/home/weissmann/.julia/environments/v1.6/Project.toml"
└ @ Base loading.jl:225
Worker 2 terminated.
Worker 3 terminated.
Worker 4 terminated.
Worker 5 terminated.
Worker 6 terminated.
Worker 7 terminated.
Worker 8 terminated.
Worker 9 terminated.
Worker 10 terminated.
Worker 11 terminated.
Worker 12 terminated.
Worker 13 terminated.
Worker 14 terminated.
Worker 15 terminated.
Worker 16 terminated.
Worker 17 terminated.
Worker 18 terminated.
Worker 19 terminated.
Worker 20 terminated.
Worker 21 terminated.
Worker 22 terminated.
Worker 23 terminated.
Worker 24 terminated.
Worker 25 terminated.
Worker 26 terminated.
Worker 27 terminated.
Worker 28 terminated.
Worker 29 terminated.
Worker 30 terminated.
Worker 31 terminated.
Worker 32 terminated.
Worker 33 terminated.
Worker 34 terminated.
Worker 35 terminated.
Worker 36 terminated.
Worker 37 terminated.
Worker 38 terminated.
Worker 39 terminated.
Worker 40 terminated.
Worker 41 terminated.
Worker 42 terminated.
Worker 43 terminated.
ERROR: LoadError: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{Distributed.LocalManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411
    
    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:532
     [2] connect
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [4] connect(manager::Distributed.LocalManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [5] create_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [6] setup_launched_worker(manager::Distributed.LocalManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [7] (::Distributed.var"#41#44"{Distributed.LocalManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:411

...and 85 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] addprocs_locked(manager::Distributed.LocalManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs_locked
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:451 [inlined]
 [5] addprocs(manager::Distributed.LocalManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [6] addprocs
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:438 [inlined]
 [7] #addprocs#245
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:443 [inlined]
 [8] addprocs(np::Int64)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:441
 [9] top-level scope
   @ ~/langevin/langevin.jl:3
in expression starting at /home/weissmann/langevin/langevin.jl:3
┌ Warning: Forcibly interrupting busy workers
│   exception = IOError: stream is closed or unusable
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1242
┌ Error: Unable to terminate all workers
│   exception =
│    IOError: stream is closed or unusable
│    Stacktrace:
│      [1] check_open
│        @ ./stream.jl:386 [inlined]
│      [2] uv_write_async(s::Sockets.TCPSocket, p::Ptr{UInt8}, n::UInt64)
│        @ Base ./stream.jl:1018
│      [3] uv_write(s::Sockets.TCPSocket, p::Ptr{UInt8}, n::UInt64)
│        @ Base ./stream.jl:981
│      [4] uv_write
│        @ ./stream.jl:977 [inlined]
│      [5] flush(s::Sockets.TCPSocket)
│        @ Base ./stream.jl:1073
│      [6] send_msg_(w::Distributed.Worker, header::Distributed.MsgHeader, msg::Distributed.RemoteDoMsg, now::Bool)
│        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:180
│      [7] send_msg
│        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:122 [inlined]
│      [8] #remote_do#153
│        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:461 [inlined]
│      [9] remote_do
│        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:461 [inlined]
│     [10] #remote_do#154
│        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
│     [11] remote_do
│        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
│     [12] kill(manager::Distributed.LocalManager, pid::Int64, config::WorkerConfig)
│        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675
│     [13] _rmprocs(pids::Vector{Int64}, waitfor::Float64)
│        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1043
│     [14] rmprocs(pids::Vector{Int64}; waitfor::Float64)
│        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1026
│     [15] terminate_all_workers()
│        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1246
│     [16] _atexit()
│        @ Base ./initdefs.jl:343
│     [17] exit
│        @ ./initdefs.jl:28 [inlined]
│     [18] exec_options(opts::Base.JLOptions)
│        @ Base ./client.jl:289
│     [19] _start()
│        @ Base ./client.jl:485
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1248
      From worker 2:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 2:	exiting.
      From worker 3:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 3:	exiting.
      From worker 4:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 4:	exiting.
      From worker 5:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 5:	exiting.
      From worker 6:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 6:	exiting.
      From worker 7:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 7:	exiting.
      From worker 8:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 8:	exiting.
      From worker 9:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 9:	exiting.
      From worker 10:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 10:	exiting.
      From worker 11:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 11:	exiting.
      From worker 12:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 12:	exiting.
      From worker 13:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 13:	exiting.
      From worker 14:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 14:	exiting.
      From worker 15:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 15:	exiting.
      From worker 16:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 16:	exiting.
      From worker 17:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 17:	exiting.
      From worker 18:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 18:	exiting.
      From worker 19:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 19:	exiting.
      From worker 20:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 20:	exiting.
      From worker 21:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 21:	exiting.
      From worker 22:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 22:	exiting.
      From worker 23:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 23:	exiting.
      From worker 24:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 24:	exiting.
      From worker 25:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 25:	exiting.
      From worker 26:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 26:	exiting.
      From worker 27:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 27:	exiting.
      From worker 28:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 28:	exiting.
      From worker 29:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 29:	exiting.
      From worker 30:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 30:	exiting.
      From worker 31:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 31:	exiting.
      From worker 32:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 32:	exiting.
      From worker 33:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 33:	exiting.
      From worker 34:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 34:	exiting.
      From worker 35:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 35:	exiting.
      From worker 36:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 36:	exiting.
      From worker 37:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 37:	exiting.
      From worker 38:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 38:	exiting.
      From worker 39:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 39:	exiting.
      From worker 40:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 40:	exiting.
      From worker 41:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 41:	exiting.
      From worker 42:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 42:	exiting.
      From worker 43:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 43:	exiting.
      From worker 49:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 49:	exiting.
      From worker 50:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 50:	exiting.
      From worker 46:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 46:	exiting.
      From worker 51:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 51:	exiting.
      From worker 48:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 48:	exiting.
      From worker 47:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 47:	exiting.
      From worker 52:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 52:	exiting.
      From worker 53:	Master process (id 1) could not connect within 60.0 seconds.
      From worker 53:	exiting.
1 Like