I don’t really have an MWE for this. It never fails on the desktop (10 workers), but when it runs on hpc (single node 40 workers) I get really messy errors.
I have plenty of distributed for loops in other sections of the code (not using @spawnat :any) and they all run fine.
Here is the line that slurm complains about:
@sync @distributed for i in 1:niter
@spawnat :any getTheta!(theta,Achn,i,c,J,k,inttype=inttype,fltype=fltype,verbose=verbose)
end
Here is the error:
ERROR: LoadError: TaskFailedException:
On worker 3:
peer 6 didn't connect to 3 within 59.99998092651367 seconds
error at ./error.jl:33
wait_for_conn at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:194
check_worker_state at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:168
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:176
send_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:134 [inlined]
#remotecall#140 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:365 [inlined]
remotecall at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:364
#remotecall#141 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:376
remotecall at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:376 [inlined]
spawnat at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:15
spawn_somewhere at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:17
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:95 [inlined]
macro expansion at /dfs5/bio/mkarikom/code/DTMwork/dev/DistributedTopicModels/src/chains.jl:132 [inlined]
#140 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:301
#160 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/macros.jl:87
#103 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:290
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:88
#96 at ./task.jl:356
...and 2 more exception(s).
Stacktrace:
[1] sync_end(::Channel{Any}) at ./task.jl:314
[2] (::Distributed.var"#159#161"{DistributedTopicModels.var"#140#142"{DataType,DataType,Bool,SharedArray{Int8,3},SharedArray{Bool,2},Int64,Int64,SharedArray{Float32,3},Channel{Any}},UnitRange{Int64}})() at ./task.jl:333
Stacktrace:
[1] sync_end(::Channel{Any}) at ./task.jl:314
[2] macro expansion at ./task.jl:333 [inlined]
[3] getThetaChain(::SharedArray{Int8,3}, ::SharedArray{Bool,2}, ::Int64; inttype::Type{T} where T, fltype::Type{T} where T, verbose::Bool) at /dfs5/bio/mkarikom/code/DTMwork/dev/DistributedTopicModels/src/chains.jl:131
[4] getThetaChain(::SharedArray{Int8,3}, ::SharedArray{Bool,2}, ::Int64) at /dfs5/bio/mkarikom/code/DTMwork/dev/DistributedTopicModels/src/chains.jl:121
[5] top-level scope at /dfs5/bio/mkarikom/code/DTMwork/pancreatic/slurm/analyze_run_tcga.jl:85
[6] include(::Function, ::Module, ::String) at ./Base.jl:380
[7] include(::Module, ::String) at ./Base.jl:368
[8] exec_options(::Base.JLOptions) at ./client.jl:296
[9] _start() at ./client.jl:506
in expression starting at /dfs5/bio/mkarikom/code/DTMwork/pancreatic/slurm/analyze_run_tcga.jl:85
┌ Warning: Forcibly interrupting busy workers
│ exception = rmprocs: pids [3] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
srun: error: hpc3-14-30: task 0: Exited with exit code 1