Addprocs crashes if localhost added before remote host

bug

#1

May or may not be a bug, it’s my first day using parallel Julia. I do find it odd however that if i do an addprocs on localhost first and then on remote:

addprocs(1)
addprocs([("<myotherbox>", 1)])

I get a crash, but if i do the remote first all’s well:

addprocs([("<myotherbox>", 1)])
addprocs(1)

Note that I can make as many addprocs calls to remote as I like, and then follow that with one or more local calls, and all is well, but any subsequent remote calls will always fail. I am running official build of 0.6.2 on up to date arch linux. I am calling addprocs manually at the REPL, a single call at a time. I get the following error, the second bit only after hitting Ctrl-C:

ERROR: connect: connection refused (ECONNREFUSED)
Stacktrace:
 [1] try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
 [2] wait() at ./event.jl:234
 [3] wait(::Condition) at ./event.jl:27
 [4] stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
 [5] wait_connected(::TCPSocket) at ./stream.jl:258
 [6] connect at ./stream.jl:983 [inlined]
 [7] connect_to_worker(::String, ::UInt16) at ./distributed/managers.jl:497
 [8] connect_w2w(::Int64, ::WorkerConfig) at ./distributed/managers.jl:452
 [9] connect(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:386
 [10] connect_to_peer(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig) at 
./distributed/process_messages.jl:329
 [11] (::Base.Distributed.##117#118{WorkerConfig,Int64})() at ./task.jl:335
Error [connect: connection refused (ECONNREFUSED)] on 3 while connecting to peer 2. Exiting.
Worker 3 terminated.
ERROR (unhandled task failure): Version read failed. Connection closed by peer.
Stacktrace:
 [1] process_hdr(::TCPSocket, ::Bool) at ./distributed/process_messages.jl:257
 [2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:143
 [3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118
 [4] (::Base.Distributed.##99#100{TCPSocket,TCPSocket,Bool})() at ./event.jl:73
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /buildworker/worker/package_linux64/build/src/jl_uv.c:132
process_events at ./libuv.jl:82 [inlined]
wait at ./event.jl:216
task_done_hook at ./task.jl:256
unknown function (ip: 0x7f044a9e672b)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
finish_task at /buildworker/worker/package_linux64/build/src/task.c:232
start_task at /buildworker/worker/package_linux64/build/src/task.c:275
unknown function (ip: 0xffffffffffffffff)

Which is all fairly meaningless to me. Just thought I’d bring it up.


#2

This definitely seems like a bug. Can you sile an issue?


#3

From the official Julia 0.6 manual:

LocalManager, used by addprocs(N), by default binds only to the loopback interface. This means that workers started later on remote hosts (or by anyone with malicious intentions) are unable to connect to the cluster. An addprocs(4) followed by an addprocs([“remote_host”]) will fail. Some users may need to create a cluster comprising their local system and a few remote systems. This can be done by explicitly requesting LocalManager to bind to an external network interface via the restrict keyword argument: addprocs(4; restrict=false).


#4

That’s what I get for posting before reading to the bottom of the page :blush: Thanks for the info.