It seems that when a worker process is added on a thread with threadid != 1 (probably), it can’t be naively removed. For example
using Distributed
new_proc = fetch(Threads.@spawn addprocs(1)) |> only
rmprocs(new_proc)
results in
ERROR: ConcurrencyViolationError("lock must be held")
Stacktrace:
[1] concurrency_violation()
@ Base ./condition.jl:8
[2] assert_havelock
@ ./condition.jl:25 [inlined]
[3] assert_havelock
@ ./condition.jl:48 [inlined]
[4] assert_havelock
@ ./condition.jl:72 [inlined]
[5] notify(c::Condition, arg::Any, all::Bool, error::Bool)
@ Base ./condition.jl:150
[6] notify (repeats 2 times)
@ ./condition.jl:148 [inlined]
[7] set_worker_state
@ ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:148 [inlined]
[8] _rmprocs(pids::Vector{Int64}, waitfor::Int64)
@ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1053
[9] rmprocs(pids::Int64; waitfor::Int64)
@ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1037
[10] rmprocs(pids::Int64)
@ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1028
[11] top-level scope
@ REPL[7]:1
This isn’t even a question of “adding multiple processes in parallel” (though that’s where it started), but rather why does seem like the threadid affects the removal but not the creation? It makes sense there would be a lock on each worker, but is that lock bound to the thread on which it was created?