Timeout issues on slurm cluster

I’ve been running into IOError: connect: connection timed out (ETIMEDOUT) with about half my jobs using ClusterManagers on a slurm cluster. This issue started appearing about a month ago with julia 1.3 and ClusterManagers 0.4 and I’ve updated to 1.5.4 to see if that fixes things - it doesn’t. I think there have been some changes to the cluster that caused this and I’ve written their support. I do have a question for Julia side as well though.

One thing I tried is setting ENV["JULIA_WORKER_TIMEOUT"] = 600.0 at the very start (before loading any packages) of my simulation. The change was visible in Distributed.worker_timeout() but my jobs still failed in less than 10min (2-4min in reports that update every 2min). Why is that the case? Is there another worker timeout variable in effect here?

Maybe it’s also important to note that I requeued many of the jobs that failed with same setup (no changes in Julia, no changes in the batch file) and they ran just fine. I’ve been using 4-8 nodes, 48 cores x 2 threads each.

Full Error
TaskFailedException:
IOError: connect: connection timed out (ETIMEDOUT)
Stacktrace:
 [1] worker_from_id(::Distributed.ProcessGroup, ::Int64) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1074
 [2] worker_from_id at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1071 [inlined]
 [3] #remote_do#154 at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]
 [4] remote_do at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:486 [inlined]
 [5] kill at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/managers.jl:598 [inlined]
 [6] create_worker(::SlurmManager, ::WorkerConfig) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:585
 [7] setup_launched_worker(::SlurmManager, ::WorkerConfig, ::Array{Int64,1}) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:526
 [8] (::Distributed.var"#41#44"{SlurmManager,Array{Int64,1},WorkerConfig})() at ./task.jl:356

...and 35 more exception(s).

Stacktrace:
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] addprocs_locked(::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:480
 [4] addprocs(::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:topology,),Tuple{Symbol}}}) at .../julia-1.5.4/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:444
 [5] #addprocs_slurm#14 at .../julia_packages/packages/ClusterManagers/Mq0H0/src/slurm.jl:100 [inlined]
 [6] top-level scope at timing.jl:174
 [7] top-level scope at .../run_sim.jl:18
 [8] include(::Function, ::Module, ::String) at ./Base.jl:380
 [9] include(::Module, ::String) at ./Base.jl:368
 [10] exec_options(::Base.JLOptions) at ./client.jl:296
 [11] _start() at ./client.jl:506

@ffreyer Do you have any progress on this issue? I would look at the Slurm logs but of course you need to be systems staff to do this. Have they said anything back to you? There may be nothing meaningful in the logs of course.
One thing - there is a Slurm PAM module which prevents a user sshing into compute nodes. You only can ssh in when you have a job active on a compute node. Maybe there is a timing issue with the PAM module being allowing access very early in the job?
I would try putting some ‘sleep 20’ statements in your job submission script.
Apologies if I am barking up a wrong tree here.

So far I’ve been asked to use their build of Julia (1.5.2), which I didn’t know existed and failed just like mine.

I think I tried adding a sleep in my batch file at some point, but if I did it’s not around anymore. I think I also had one in my startup script at some point. I’ll throw a sleep 60 in both and see if that helps.

On another note, the last two jobs I submitted did actually run, with @time addprocs_slum(...) taking 78s (7 nodes, 48 cores each) and 94s (5 nodes, 48 cores each). (~0.1% gc time in either case)

Also sorry about not providing any code here, I thought I did but that must’ve been a version of the post I discarded. The original version of my sbatch script was just

#!/bin/bash -l
#SBATCH ... (setting time, account, partition, nodes, ntasks-per-node, cpus-per-task)

julia path/to/startup/script.jl <generated_number_of_tasks> 24

My simulations take much longer than 1 day to run but the cluster limits jobs to 1 day, so I have some stuff set up to automatically resubmit jobs. It generates my sbatch script with some fitting number of tasks/processes. (Meaning <generated_number_of_tasks> is replaced by 336 (7*48) for example.)

The original startup script up to the line that errors was

@info "Entered script w/ ARGS: $ARGS"

using ClusterManagers, Distributed, LinearAlgebra, Dates

before = nprocs()
addprocs(SlurmManager(parse(Int64, ARGS[1])))
...

I’ve also tried addprocs_slurm and using topology = :master_worker though I will need all to all eventually. I’ve tried repeating addprocs with increasing delays (sleep) between repeats and I’ve tried adding workers in chunks of 48. Both of these eventually result in srun: error: tasks ...: Exited with exit code 143 (or 1) and sometimes segmentation faults.

I now have one job submitted and running with a sleep 60 in the sbatch script and the julia script before addprocs_slurm. The timing I’ve gotten is 5.75s (5 nodes, 48 cores each) which seems promising. I’ll let some more jobs run jobs run and report back when I have a larger sample size.

Apparently julia is not using the right address/hostname when adding processes. The correct hostname for infiniband connections, in my case, is <hostname>i.juwels but julia uses <hostname>.juwels.

Now I’m wondering how to add processes via a given local address. I tried something along the lines of addprocs(["xx.xx.xx.xx:9468"]) which resulted in the connection being refused.