Communication error between local processes and remote processes

Hi,
I’m experimenting with a setup and I’m getting a strange behavior.

using Distributed

loc_proc = addprocs(4) # this adds processes 2 to 5 on the local machine
rem_proc = addprocs([("foo@remote_machine", :auto)]) # this adds processes 6 to 37 on a remote machine
rem_proc_2 = addprocs([("foo@remote_machine", :auto)]) # this add processes 38 to 45 on another remote machine

if I run

r = remotecall(rand, 6, 2, 2)
s = @spawnat 38 1 .+ fetch(r)
fetch(s)

everything works as expected, the remote workers are able to communicate between them.

If I try to mix remote and local processes as in

r = remotecall(rand, 2, 2, 2)
s = @spawnat 6 1 .+ fetch(r)
fetch(s)

I get the following error:

From worker 6: ┌ Error: Error on 6 while connecting to peer 2, exiting
From worker 6: │ exception =
From worker 6: │ IOError: connect: connection refused (ECONNREFUSED)
From worker 6: │ Stacktrace:
From worker 6: │ [1] wait_connected(x::Sockets.TCPSocket)
From worker 6: │ @ Sockets /opt/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
From worker 6: │ [2] connect
From worker 6: │ @ /opt/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
From worker 6: │ [3] connect_to_worker(host::String, port::Int64)
From worker 6: │ @ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
From worker 6: │ [4] connect_w2w(pid::Int64, config::WorkerConfig)
From worker 6: │ @ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:580
From worker 6: │ [5] connect(manager::Distributed.DefaultClusterManager, pid::Int64, config::WorkerConfig)
From worker 6: │ @ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:512
From worker 6: │ [6] connect_to_peer(manager::Distributed.DefaultClusterManager, rpid::Int64, wconfig::WorkerConfig)
From worker 6: │ @ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:355
From worker 6: │ [7] (::Distributed.var"#121#123"{Int64, WorkerConfig})()
From worker 6: │ @ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:342
From worker 6: │ [8] exec_conn_func(w::Distributed.Worker)
From worker 6: │ @ Distributed /opt/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:181
From worker 6: │ [9] (::Distributed.var"#21#24"{Distributed.Worker})()
From worker 6: │ @ Distributed ./task.jl:429
From worker 6: └ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:362

I’m confused since I did not change the default topology, so I would expect that all processes would communicate with all processes.

Ok, it seems that specifying

loc_proc = addprocs(4; restrict = false)

solves the issue.

As stated in the documentation

addprocs(np::Integer; restrict=true, kwargs...)

Launches workers using the in-built LocalManager which only launches workers on the local host. This can be used to take advantage of multiple cores. addprocs(4) will add 4 processes on the local machine. If restrict is true, binding is restricted to 127.0.0.1. Keyword args dir, exename, exeflags, topology, lazy and enable_threaded_blas have the same effect as documented for addprocs(machines).