Channel hangs with >8 threads but not 7 in REPL

Hello!

I have the following code that uses Channel to manage Tasks created with Threads.@spawn. The idea is to be able to run a method (called simulate()) in parallel using multiple threads, wait for some timeout for the tasks to finish, and to surface any errors encountered.

function some_method()
    # ...
    println("Before sim start")
    timeout_us = CPUtime_us() + max_time * 1e6
    sim_channel = Channel{Task}(min(1000, n_iterations)) do channel
        for n in 1:n_iterations
            put!(channel, Threads.@spawn simulate(...))
            # println("Put ", n, ", ", length(channel.data))
        end
        println("Channel task finished ", channel)
    end

    for sim_task in sim_channel
        CPUtime_us() > timeout_us && break
        try
            fetch(sim_task)  # Throws a TaskFailedException if failed.
        catch err
            throw(err.task.exception)  # Throw the underlying exception.
        end
    end
    # ...
end

This code runs fine in REPL (i.e., julia -t 8 and julia> include("script.jl")) with 7 threads but hangs with 8. By hanging I mean I see Before sim start getting printed at which point REPL hangs. The fact that it hangs with 8 threads but not 7 in REPL is already strange enough, but what’s even stranger is that if I uncomment the print statement after put! on channel, it doesn’t hang even with 8 threads. So my suspicion at that point was perhaps there is some random initialization of threads that’s affected by the code file causing things to somehow hang with 8 threads, as strange as that might sound.

Would really appreciate any help with the issue and curious to hear what’s the general strategy to debug this kind of issues in Julia.

Thanks!

Could you provide a complete minimal example to reproduce the problem?

(By the way to answer your methodology question: making a minimal example would be my first step to investigate the issue :slight_smile: )

This is a little odd. Though note that println() can yield control to the scheduler so enabling printing on that Task gives other tasks a chance to run. For example, simulation tasks might then interleave with your task which is adding to sim_channel. So if the bug you see is somehow dependent on scheduling order, it’s not entirely surprising that adding print debugging changes the behavior.

I guess you could try a low level call to printf in order to get some output without disturbing the scheduling order:

@ccall printf("Put %d\n"::Cstring; n::Cint)::Cint

In low level concurrent Julia programming it’s easy to get hangs when channel capacity is wrong for the ordering of take! and put!, especially if a task crashes and a close(channel) is forgotten. I can’t see any of the more obvious issues here though. A self-contained example would be best if you can make one.

What architecture and Julia Version do you use? People reported similar behaviour (7 vs 8 threads) on the new Macbooks with M1 Pro/Max here (both native and rosetta).

Thanks for the suggestion. I tried to create a “minimal” example that would re-produce the issue but ended up creating a not-so-minimal one (but that still re-produces) :slightly_smiling_face:

First of all, this is the script to run: MCTS.jl/dev.jl at dev · kykim0/MCTS.jl · GitHub. To run this, you need to install the MCTS.jl package (see GitHub - JuliaPOMDP/MCTS.jl: Monte Carlo Tree Search for Markov decision processes using the POMDPs.jl framework), and replace src/vanilla.jl with the version in my repo which is MCTS.jl/vanilla.jl at dev · kykim0/MCTS.jl · GitHub. Then do e.g., julia --project -t 8 test/dev.jl.

Please let me know if you can reproduce the issue on your machine. @hexaeder mentions below that similar issues have been observed on M1, and I was indeed using an M1 machine. Curious to see if this has to do with hardware compatibility.

Thanks for the suggestion. Please see my response to @sudete for reproducing the issue (apologies for it being not so “minimal”).

I also tried your lower level printf in place of println with 8 threads, and I saw n_iterations (50 in my run) many put statements, but not the println statement after the for loop. I guess this means it did create all the Tasks it was supposed to but then failed to exit out of the for loop somehow, which is more revealing but equally perplexing.

1 Like

Thanks for the pointer. I was indeed running things on an M1 machine with Julia 1.6.3. I’ll switch to a linux machine and see if I can reproduce the issue there.