I am using ClusterManagers.jl
and Distributed.jl
to connect to a cluster. I’ve done this many times before and seem to understand how it all works, but something that eludes me is that when adding a large number of processors. It seems to “hang” for a good 5 - 10 minutes after the last “connecting to worker 320 out of 320”
What can cause this delay?
In my case:
[affans@node004 htop-2.2.0]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9753 defq julia-35 affans R 13:04 10 node[001-010]
It seems that workers are up for 13 minutes now and still my main terminal is hang on the "connecting to … " part