Addprocs crashes if localhost added before remote host

polypus74 · January 13, 2018, 4:00am

May or may not be a bug, it’s my first day using parallel Julia. I do find it odd however that if i do an addprocs on localhost first and then on remote:

addprocs(1)
addprocs([("<myotherbox>", 1)])

I get a crash, but if i do the remote first all’s well:

addprocs([("<myotherbox>", 1)])
addprocs(1)

Note that I can make as many addprocs calls to remote as I like, and then follow that with one or more local calls, and all is well, but any subsequent remote calls will always fail. I am running official build of 0.6.2 on up to date arch linux. I am calling addprocs manually at the REPL, a single call at a time. I get the following error, the second bit only after hitting Ctrl-C:

ERROR: connect: connection refused (ECONNREFUSED)
Stacktrace:
 [1] try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
 [2] wait() at ./event.jl:234
 [3] wait(::Condition) at ./event.jl:27
 [4] stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
 [5] wait_connected(::TCPSocket) at ./stream.jl:258
 [6] connect at ./stream.jl:983 [inlined]
 [7] connect_to_worker(::String, ::UInt16) at ./distributed/managers.jl:497
 [8] connect_w2w(::Int64, ::WorkerConfig) at ./distributed/managers.jl:452
 [9] connect(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:386
 [10] connect_to_peer(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig) at 
./distributed/process_messages.jl:329
 [11] (::Base.Distributed.##117#118{WorkerConfig,Int64})() at ./task.jl:335
Error [connect: connection refused (ECONNREFUSED)] on 3 while connecting to peer 2. Exiting.
Worker 3 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
Stacktrace:
 [1] process_hdr(::TCPSocket, ::Bool) at ./distributed/process_messages.jl:257
 [2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:143
 [3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118
 [4] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73

^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /buildworker/worker/package_linux64/build/src/jl_uv.c:132
process_events at ./libuv.jl:82 [inlined]
wait at ./event.jl:216
task_done_hook at ./task.jl:256
unknown function (ip: 0x7f044a9e672b)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
finish_task at /buildworker/worker/package_linux64/build/src/task.c:232
start_task at /buildworker/worker/package_linux64/build/src/task.c:275
unknown function (ip: 0xffffffffffffffff)

Which is all fairly meaningless to me. Just thought I’d bring it up.

StefanKarpinski · January 13, 2018, 4:51am

This definitely seems like a bug. Can you sile an issue?

tk3369 · January 13, 2018, 10:39am

From the official Julia 0.6 manual:

LocalManager, used by addprocs(N), by default binds only to the loopback interface. This means that workers started later on remote hosts (or by anyone with malicious intentions) are unable to connect to the cluster. An addprocs(4) followed by an addprocs([“remote_host”]) will fail. Some users may need to create a cluster comprising their local system and a few remote systems. This can be done by explicitly requesting LocalManager to bind to an external network interface via the restrict keyword argument: addprocs(4; restrict=false).

polypus74 · January 13, 2018, 3:14pm

That’s what I get for posting before reading to the bottom of the page Thanks for the info.

Topic		Replies	Views
Communication error between local processes and remote processes General Usage parallel	1	341	June 17, 2022
`addprocs` crashes with `connection refused (ECONNREFUSED)` General Usage distributed	0	459	August 9, 2021
Addprocs with ssh does not work on 0.6.1 Julia at Scale	3	875	November 24, 2017
`addprocs(["remote"])` does not work, but `ssh remote julia --version` does. why? General Usage question	10	2528	July 26, 2017
Addprocs using New to Julia	1	535	June 30, 2020

Addprocs crashes if localhost added before remote host

Related topics