Multithreaded program hangs without explict GC.gc()

I have the following code that uses Julia’s atomic variables and hangs if I do not use gc. I run the program with julia -t 6.

I set two atomic variables to false, then spawn two threads that wait in a while loop for them to become true. I set the variables to true in the main thread. This is done n times (e.g. n = 100,000)

Unless I quit a thread with a timeout (the if statement in f_wait), the program hangs in such a way that Ctrl+C doesn’t return to REPL. Even if I run @async test(100000) , the program does not get back to REPL. The program enters the if a few times in 100000 loops.

Adding GC.gc() in the for loop below fixes the issue. Could someone explain the workings here? Does this mean that when writing similar programs that use threads/atomics/tasks we need to run GC manually? Note, that GC.safepoint() does not fix the issue.

using Base.Threads
using ProgressBars

function f_wait(a, b)
    start = time()
    while !a[]
        # The program hangs without this if statement.
        if time() > start + 10
            @info "Waited for 10s" a[] b[]
            break
        end
    end
    return a[] && b[]
end;

function test(n, x = Atomic{Bool}(false), y = Atomic{Bool}(false))
    for i in ProgressBar(1:n)
        x[] = y[] = false
        t_wx = @spawn f_wait(x, y);
        t_wy = @spawn f_wait(y, x);
        x[] = y[] = true

        wait.([t_wx, t_wy])
        
        # i % 10000 == 0 && GC.gc() # Uncommenting this line fixes the issue.
        # GC.safepoint() # This does not help.
    end
    return true
end

test(100000)

Side note: I stumbled across this while reading an article about memory ordering in C++ and decided to try things in Julia.

I think it may not be a memory issue, but instead a synchronization issue.

Testing this on my machine does freeze up. But if I add yield() to the last line of the while loop in f_wait then it no longer locks.

My guess is that calling GC.gc() forces the running threads to yield() in order to be garbage collected, whereas without GC.gc() being called the while loop simply continues, and old task objects aren’t collected, until they build up and overwhelm the scheduler and cause a lockup?

That’s just a guess at the mechanism, but adding yield() into the while loop seems to prevent a lockup.

2 Likes

I would have thought that at the end of each loop (where GC is now), there are no running tasks anymore because of the wait on the spawned threads. Are finished tasks still affecting the scheduler? Unless, of course, the compiler decides to optimize and skip the wait to the next loop, setting the atomics to false. Surely that shouldn’t happen.

By the way, if I leave only one task (e.g. t_wx), the program does not hang up, somehow it’s the two together that cause a problem.

1 Like

That’s a good point, a finished task should really have no effect on the scheduler. Unless the process of removing a task from the scheduler (after completion) is only triggered by a yield or a GC call?

Hopefully someone with a better understanding of GC or threading in Julia can chime in.