Can't remove process when added on different thread

It seems that when a worker process is added on a thread with threadid != 1 (probably), it can’t be naively removed. For example

using Distributed

new_proc = fetch(Threads.@spawn addprocs(1)) |> only
rmprocs(new_proc)

results in

ERROR: ConcurrencyViolationError("lock must be held")
Stacktrace:
  [1] concurrency_violation()
    @ Base ./condition.jl:8
  [2] assert_havelock
    @ ./condition.jl:25 [inlined]
  [3] assert_havelock
    @ ./condition.jl:48 [inlined]
  [4] assert_havelock
    @ ./condition.jl:72 [inlined]
  [5] notify(c::Condition, arg::Any, all::Bool, error::Bool)
    @ Base ./condition.jl:150
  [6] notify (repeats 2 times)
    @ ./condition.jl:148 [inlined]
  [7] set_worker_state
    @ ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:148 [inlined]
  [8] _rmprocs(pids::Vector{Int64}, waitfor::Int64)
    @ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1053
  [9] rmprocs(pids::Int64; waitfor::Int64)
    @ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1037
 [10] rmprocs(pids::Int64)
    @ Distributed ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/cluster.jl:1028
 [11] top-level scope
    @ REPL[7]:1

This isn’t even a question of “adding multiple processes in parallel” (though that’s where it started), but rather why does seem like the threadid affects the removal but not the creation? It makes sense there would be a lock on each worker, but is that lock bound to the thread on which it was created?

Unfortunately Distributed was never threadsafe :upside_down_face: That has been fixed in Julia 1.12, and indeed your MWE works for me in the 1.12 beta.

3 Likes

Huh, TIL. 1.12 is gonna be sick!