Slurm doesn't like @spawnat :any

I don’t really have an MWE for this. It never fails on the desktop (10 workers), but when it runs on hpc (single node 40 workers) I get really messy errors.

I have plenty of distributed for loops in other sections of the code (not using @spawnat :any) and they all run fine.

Here is the line that slurm complains about:

        @sync @distributed for i in 1:niter
            @spawnat :any getTheta!(theta,Achn,i,c,J,k,inttype=inttype,fltype=fltype,verbose=verbose)
        end

Here is the error:

ERROR: LoadError: TaskFailedException:
On worker 3:
peer 6 didn't connect to 3 within 59.99998092651367 seconds
error at ./error.jl:33
wait_for_conn at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:194
check_worker_state at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:168
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:176
send_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:134 [inlined]
#remotecall#140 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:365 [inlined]
remotecall at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:364
#remotecall#141 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:376
remotecall at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:376 [inlined]
spawnat at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:15
spawn_somewhere at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:17
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:95 [inlined]
macro expansion at /dfs5/bio/mkarikom/code/DTMwork/dev/DistributedTopicModels/src/chains.jl:132 [inlined]
#140 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:301
#160 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:87
#103 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:290
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:88
#96 at ./task.jl:356

...and 2 more exception(s).

Stacktrace:
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] (::Distributed.var"#159#161"{DistributedTopicModels.var"#140#142"{DataType,DataType,Bool,SharedArray{Int8,3},SharedArray{Bool,2},Int64,Int64,SharedArray{Float32,3},Channel{Any}},UnitRange{Int64}})() at ./task.jl:333
Stacktrace:
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] getThetaChain(::SharedArray{Int8,3}, ::SharedArray{Bool,2}, ::Int64; inttype::Type{T} where T, fltype::Type{T} where T, verbose::Bool) at /dfs5/bio/mkarikom/code/DTMwork/dev/DistributedTopicModels/src/chains.jl:131
 [4] getThetaChain(::SharedArray{Int8,3}, ::SharedArray{Bool,2}, ::Int64) at /dfs5/bio/mkarikom/code/DTMwork/dev/DistributedTopicModels/src/chains.jl:121
 [5] top-level scope at /dfs5/bio/mkarikom/code/DTMwork/pancreatic/slurm/analyze_run_tcga.jl:85
 [6] include(::Function, ::Module, ::String) at ./Base.jl:380
 [7] include(::Module, ::String) at ./Base.jl:368
 [8] exec_options(::Base.JLOptions) at ./client.jl:296
 [9] _start() at ./client.jl:506
in expression starting at /dfs5/bio/mkarikom/code/DTMwork/pancreatic/slurm/analyze_run_tcga.jl:85
┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [3] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
srun: error: hpc3-14-30: task 0: Exited with exit code 1

The @distributed already makes it so each iteration of the loop runs across different workers, do you actually also need the @spawnat, which then further makes it so each of those workers tries to launch a distributed task on another worker?

1 Like

Thanks @marius311, so if I just do:

@sync @distributed for i in 1:niter
    getTheta!(theta,Achn,i,c,J,k,inttype=inttype,fltype=fltype,verbose=verbose)
end

Will this then launch getTheta!(...) on any available worker?

Correct.

1 Like