What causes the delay when adding procs

affans · September 16, 2022, 4:04am

I am using ClusterManagers.jl and Distributed.jl to connect to a cluster. I’ve done this many times before and seem to understand how it all works, but something that eludes me is that when adding a large number of processors. It seems to “hang” for a good 5 - 10 minutes after the last “connecting to worker 320 out of 320”

What can cause this delay?

In my case:

[affans@node004 htop-2.2.0]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9753      defq julia-35   affans  R      13:04     10 node[001-010]

It seems that workers are up for 13 minutes now and still my main terminal is hang on the "connecting to … " part

Topic		Replies	Views
Addprocs using New to Julia	1	533	June 30, 2020
ClusterManagers.jl hangs on pbs Julia at Scale parallel	9	2277	March 3, 2020
Addprocs with SSH and a ProxyJump times out Julia at Scale question	7	1630	June 22, 2021
Error when using @distributed for on cluster with multiple nodes Julia at Scale cluster	4	1768	August 31, 2018
Job hangs - "waiting for job to start" on a PBS Cluster General Usage	2	1240	October 14, 2017

What causes the delay when adding procs

Related topics