I’m using ClusterManagers.jl to create an ElasticManager to connect some workers on a cluster. This is working between some nodes on the cluster (manager and workers both on login nodes) but not between others (manager on login node, workers on compute nodes, which is in fact what I need). I’m guessing this is signaling some sort of networking/firewall problems.
Here’s what I’ve found:
- I can
telnet
between compute/login nodes on the same ports I use for the Julia cluster just fine. - I can by-hand create a Julia
Socket
connection between compute/login nodes and send data back-and-forth just fine. - Creating the ElasticManager and connecting a remote worker does not work. The series of commands is I run
ElasticManager(...)
on the login node then from the compute nodes I dojulia -e 'using ClusterManagers; ClusterManagers.elastic_worker(<cookie>,<ip>,<port>)
to connect the workers, but this times out after 60sec. I’ve got it traced to that the connection process at least makes it to calling and returning from thelaunch
function at https://github.com/JuliaLang/julia/blob/v1.1.0/stdlib/Distributed/src/cluster.jl#L399 but I don’t have a good way to insert debug statements in the remaining Julia stdlib code itself (short of a slow-to-iterate Julia recompile).
Any suggestions how I can figure out what’s happening and fix it? Thanks.