Communication error between local processes and remote processes

Isaia_Nisoli · June 17, 2022, 4:48am

Hi,
I’m experimenting with a setup and I’m getting a strange behavior.

using Distributed

loc_proc = addprocs(4) # this adds processes 2 to 5 on the local machine
rem_proc = addprocs([("foo@remote_machine", :auto)]) # this adds processes 6 to 37 on a remote machine
rem_proc_2 = addprocs([("foo@remote_machine", :auto)]) # this add processes 38 to 45 on another remote machine

if I run

r = remotecall(rand, 6, 2, 2)
s = @spawnat 38 1 .+ fetch(r)
fetch(s)

everything works as expected, the remote workers are able to communicate between them.

If I try to mix remote and local processes as in

r = remotecall(rand, 2, 2, 2)
s = @spawnat 6 1 .+ fetch(r)
fetch(s)

I get the following error:

From worker 6: From worker 6: │ exception =
From worker 6: From worker 6: │ Stacktrace:
From worker 6: │ From worker 6: │ From worker 6: │ [2] connect
From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: │ From worker 6: ┌ Error: Error on 6 while connecting to peer 2, exiting
│ IOError: connect: connection refused (ECONNREFUSED)
[1] wait_connected(x::Sockets.TCPSocket)
@ Sockets /opt/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
@ /opt/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
[3] connect_to_worker(host::String, port::Int64)
@ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
[4] connect_w2w(pid::Int64, config::WorkerConfig)
@ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:580
[5] connect(manager::Distributed.DefaultClusterManager, pid::Int64, config::WorkerConfig)
@ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:512
[6] connect_to_peer(manager::Distributed.DefaultClusterManager, rpid::Int64, wconfig::WorkerConfig)
@ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:355
[7] (::Distributed.var"#121#123"{Int64, WorkerConfig})()
@ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:342
[8] exec_conn_func(w::Distributed.Worker)
@ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:181
[9] (::Distributed.var"#21#24"{Distributed.Worker})()
@ Distributed ./task.jl:429
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:362

I’m confused since I did not change the default topology, so I would expect that all processes would communicate with all processes.

Isaia_Nisoli · June 17, 2022, 5:02am

Ok, it seems that specifying

loc_proc = addprocs(4; restrict = false)

solves the issue.

As stated in the documentation

addprocs(np::Integer; restrict=true, kwargs...)

Launches workers using the in-built LocalManager which only launches workers on the local host. This can be used to take advantage of multiple cores. addprocs(4) will add 4 processes on the local machine. If restrict is true, binding is restricted to 127.0.0.1. Keyword args dir, exename, exeflags, topology, lazy and enable_threaded_blas have the same effect as documented for addprocs(machines).

Topic		Replies	Views
Addprocs crashes if localhost added before remote host General Usage bug	3	888	January 13, 2018
`addprocs` crashes with `connection refused (ECONNREFUSED)` General Usage distributed	0	459	August 9, 2021
Addprocs() on remote machines failing Julia at Scale	6	1130	December 9, 2019
Unreachable host when adding remote workers Julia at Scale parallel , distributed , remote	0	321	January 26, 2023
How make a true distributed program Performance distributed	5	632	January 5, 2022

Communication error between local processes and remote processes

Related topics