(Running julia 1.6.1 if that matters)
Seeking help in understanding what am I doing wrong in an application which combines multiple worker processes and multiple threads within each one. I’ve been pushing and pulling at this for a while, any advice woyld be appreciated.
Combining worker processes and threads seems to be not well-supported by Julia (that is, compared to the excellent support for using either multiple worker processes or multiple threads).
The package I’m writing attempts to provide convenient API for this combination, the notion being that one runs a coherent application across multiple servers in a compute cluster, each with multiple CPUs.
At any rate, things seem to be working reasonably well, except that I get really strange errors in some tests. Specifically, if you repeatedly run only the dt_foreach test (a test which runs a loop across all threads of all worker processors), I see:
- The test passes (most often)
- Deadlocks (less common, but not that rare)
- An error saying I’m taking from a closed channel (which “can’t happen”). Pretty rare, but happens.
- An error converting an object from one type to another (which also “can’t happen”). Also pretty rare, but happens.
- GC crashing on “task switch within GC” (which really “can’t happen”). I only saw this once.
The detailed crash traces are available in Errors running dt_foreach test in Snarl.jl commit 8d82051 · GitHub
The code is in GitHub - orenbenkiki/Snarl.jl: An opinionated framework for parallel processing. - latest commit 8d82051e7fac06dbf3e153fac64d81b59a68ef2d as of the time I’m writing this.
I’m assuming I’m doing something very wrong, but I can’t figure out what. Seeing the GC crash got me suspecting I might be pushing Julia beyond what it is capable of - perhaps even a bug in the standard libraries/compiler - but of course it is still more likely the fault is mine.
The code that breaks is a small part of the system, actually just a part of the testing code. I’m spawning a service thread that runs “forever” on the main process, which listens on a channel which is known everywhere. Every other thread (on either the main process or on a worker process) can send, on this channel, a request for a unique value for a counter, which is encoded as a tuple of (counter-identifier, response-channel). The service thread increments the designated counter (there’s an array of them) and sends the result unique value on the response channel, then closes it (closes the response channel, not the global channel, of course). That is, the response channel is only used to pass one message, possibly between threads in different processes.
Here is the relevant code fragment:
used_counters = zeros(Int, COUNTERS)
function make_counters_channel()::Channel{Tuple{Int,RemoteChannel{Channel{Int}}}}
return Channel{Tuple{Int,RemoteChannel{Channel{Int}}}}(nprocs() * nthreads())
end
@send_everywhere counters_channel RemoteChannel(make_counters_channel)
function serve_counters()::Nothing
while true
request = take!(counters_channel)
counter = request[1]
response_channel = request[2]
used_counters[counter] += 1
put!(response_channel, used_counters[counter])
close(response_channel)
end
error("Never happens") # untested
end
remote_do(serve_counters, 1)
@everywhere function next!(counter::Int)::Int
response_channel = RemoteChannel(() -> Channel{Int}(1))
put!(counters_channel, (counter, response_channel))
sleep(0) # Doesn't solve the problem.
atomic_fence() # Doesn't solve the problem.
yield() # Doesn't solve the problem.
# Problems manifested here:
# - Deadlock
# - GC error
# - Taking from a closed channel.
return take!(response_channel)
end
remote_do(serve_counters, 1)
@everywhere function next!(counter::Int)::Int
response_channel = RemoteChannel(() -> Channel{Int}(1))
put!(counters_channel, (counter, response_channel))
sleep(0) # Doesn't solve the problem.
atomic_fence() # Doesn't solve the problem.
yield() # Doesn't solve the problem.
# Problems manifested here:
# - Deadlock
# - GC error
# - Taking from a closed channel.
return take!(response_channel)
end
Where @send_everywhere
is a macro I provide for setting a global variable “@everywhere”.
Am I doing something very wrong here, or did I hit on a bug?
Any help would be appreciated.