I have the following code that uses Channel to manage Tasks created with Threads.@spawn. The idea is to be able to run a method (called simulate()) in parallel using multiple threads, wait for some timeout for the tasks to finish, and to surface any errors encountered.
function some_method()
# ...
println("Before sim start")
timeout_us = CPUtime_us() + max_time * 1e6
sim_channel = Channel{Task}(min(1000, n_iterations)) do channel
for n in 1:n_iterations
put!(channel, Threads.@spawn simulate(...))
# println("Put ", n, ", ", length(channel.data))
end
println("Channel task finished ", channel)
end
for sim_task in sim_channel
CPUtime_us() > timeout_us && break
try
fetch(sim_task) # Throws a TaskFailedException if failed.
catch err
throw(err.task.exception) # Throw the underlying exception.
end
end
# ...
end
This code runs fine in REPL (i.e., julia -t 8 and julia> include("script.jl")) with 7 threads but hangs with 8. By hanging I mean I see Before sim start getting printed at which point REPL hangs. The fact that it hangs with 8 threads but not 7 in REPL is already strange enough, but what’s even stranger is that if I uncomment the print statement after put! on channel, it doesn’t hang even with 8 threads. So my suspicion at that point was perhaps there is some random initialization of threads that’s affected by the code file causing things to somehow hang with 8 threads, as strange as that might sound.
Would really appreciate any help with the issue and curious to hear what’s the general strategy to debug this kind of issues in Julia.
This is a little odd. Though note that println() can yield control to the scheduler so enabling printing on that Task gives other tasks a chance to run. For example, simulation tasks might then interleave with your task which is adding to sim_channel. So if the bug you see is somehow dependent on scheduling order, it’s not entirely surprising that adding print debugging changes the behavior.
I guess you could try a low level call to printf in order to get some output without disturbing the scheduling order:
@ccall printf("Put %d\n"::Cstring; n::Cint)::Cint
In low level concurrent Julia programming it’s easy to get hangs when channel capacity is wrong for the ordering of take! and put!, especially if a task crashes and a close(channel) is forgotten. I can’t see any of the more obvious issues here though. A self-contained example would be best if you can make one.
What architecture and Julia Version do you use? People reported similar behaviour (7 vs 8 threads) on the new Macbooks with M1 Pro/Max here (both native and rosetta).
Thanks for the suggestion. I tried to create a “minimal” example that would re-produce the issue but ended up creating a not-so-minimal one (but that still re-produces)
Please let me know if you can reproduce the issue on your machine. @hexaeder mentions below that similar issues have been observed on M1, and I was indeed using an M1 machine. Curious to see if this has to do with hardware compatibility.
Thanks for the suggestion. Please see my response to @sijo for reproducing the issue (apologies for it being not so “minimal”).
I also tried your lower level printf in place of println with 8 threads, and I saw n_iterations (50 in my run) many put statements, but not the println statement after the for loop. I guess this means it did create all the Tasks it was supposed to but then failed to exit out of the for loop somehow, which is more revealing but equally perplexing.
Thanks for the pointer. I was indeed running things on an M1 machine with Julia 1.6.3. I’ll switch to a linux machine and see if I can reproduce the issue there.