How to debug remote worker not connecting?


I’m using ClusterManagers.jl to create an ElasticManager to connect some workers on a cluster. This is working between some nodes on the cluster (manager and workers both on login nodes) but not between others (manager on login node, workers on compute nodes, which is in fact what I need). I’m guessing this is signaling some sort of networking/firewall problems.

Here’s what I’ve found:

  • I can telnet between compute/login nodes on the same ports I use for the Julia cluster just fine.
  • I can by-hand create a Julia Socket connection between compute/login nodes and send data back-and-forth just fine.
  • Creating the ElasticManager and connecting a remote worker does not work. The series of commands is I run ElasticManager(...) on the login node then from the compute nodes I do julia -e 'using ClusterManagers; ClusterManagers.elastic_worker(<cookie>,<ip>,<port>) to connect the workers, but this times out after 60sec. I’ve got it traced to that the connection process at least makes it to calling and returning from the launch function at but I don’t have a good way to insert debug statements in the remaining Julia stdlib code itself (short of a slow-to-iterate Julia recompile).

Any suggestions how I can figure out what’s happening and fix it? Thanks.


Ok, I think I’ve found the cause of the issue. Seems the cluster blocks TCP connections from the login nodes to the compute nodes. I thought Julia only needed the other direction to be allowed, since the worker processes on the compute nodes connect to the login node, but after digging into the code a bit it seems like after the initial connection, the master process then initiates a connection in the other direction (, and its this that’s failing. Not sure a good workaround here…

(Btw, for anyone wondering, turns out good way to insert debug statements into Julia Stdlib or Base without recompiling Julia is just to @eval them directly into these modules)