Multithreaded program hangs without explict GC.gc()

I have the following code that uses Julia’s atomic variables and hangs if I do not use gc. I run the program with julia -t 6.

I set two atomic variables to false, then spawn two threads that wait in a while loop for them to become true. I set the variables to true in the main thread. This is done n times (e.g. n = 100,000)

Unless I quit a thread with a timeout (the if statement in f_wait), the program hangs in such a way that Ctrl+C doesn’t return to REPL. Even if I run @async test(100000) , the program does not get back to REPL. The program enters the if a few times in 100000 loops.

Adding GC.gc() in the for loop below fixes the issue. Could someone explain the workings here? Does this mean that when writing similar programs that use threads/atomics/tasks we need to run GC manually? Note, that GC.safepoint() does not fix the issue.

using Base.Threads
using ProgressBars

function f_wait(a, b)
    start = time()
    while !a[]
        # The program hangs without this if statement.
        if time() > start + 10
            @info "Waited for 10s" a[] b[]
            break
        end
    end
    return a[] && b[]
end;

function test(n, x = Atomic{Bool}(false), y = Atomic{Bool}(false))
    for i in ProgressBar(1:n)
        x[] = y[] = false
        t_wx = @spawn f_wait(x, y);
        t_wy = @spawn f_wait(y, x);
        x[] = y[] = true

        wait.([t_wx, t_wy])
        
        # i % 10000 == 0 && GC.gc() # Uncommenting this line fixes the issue.
        # GC.safepoint() # This does not help.
    end
    return true
end

test(100000)

Side note: I stumbled across this while reading an article about memory ordering in C++ and decided to try things in Julia.

I think it may not be a memory issue, but instead a synchronization issue.

Testing this on my machine does freeze up. But if I add yield() to the last line of the while loop in f_wait then it no longer locks.

My guess is that calling GC.gc() forces the running threads to yield() in order to be garbage collected, whereas without GC.gc() being called the while loop simply continues, and old task objects aren’t collected, until they build up and overwhelm the scheduler and cause a lockup?

That’s just a guess at the mechanism, but adding yield() into the while loop seems to prevent a lockup.

3 Likes

I would have thought that at the end of each loop (where GC is now), there are no running tasks anymore because of the wait on the spawned threads. Are finished tasks still affecting the scheduler? Unless, of course, the compiler decides to optimize and skip the wait to the next loop, setting the atomics to false. Surely that shouldn’t happen.

By the way, if I leave only one task (e.g. t_wx), the program does not hang up, somehow it’s the two together that cause a problem.

1 Like

That’s a good point, a finished task should really have no effect on the scheduler. Unless the process of removing a task from the scheduler (after completion) is only triggered by a yield or a GC call?

Hopefully someone with a better understanding of GC or threading in Julia can chime in.

Coming in late here, but I think this is a textbook case of what the docs for GC.safepoint are talking about. The solution is to insert a safepoint within the loop in f_wait (not within test), as follows:

using Base.Threads
using ProgressBars

function f_wait(a, b)
    while !a[]
        GC.safepoint()
    end
    return a[] && b[]
end;

function test(n, x = Atomic{Bool}(false), y = Atomic{Bool}(false))
    for i in ProgressBar(1:n)
        x[] = y[] = false
        t_wx = @spawn f_wait(x, y);
        t_wy = @spawn f_wait(y, x);
        x[] = y[] = true

        wait.([t_wx, t_wy])
    end
    return true
end

test(100000)

In more detail: The problem is that f_wait has a potentially infinite loop with no allocations, IO, or task switches, hence no implicit GC safepoints, thus blocking GC for a potentially infinite time. Meanwhile, your test function performs an allocation when it creates the t_wy task, and this happens after t_wx has been scheduled, but before its termination condition x[] is set to true. Whenever this particular allocation triggers a GC run, you have a deadlock—the main thread is waiting for every other thread to reach a safepoint so the GC can do its sweep, while t_wx is waiting for the main thread to set x[] to true, never encountering a safepoint during the wait. The solution is to introduce a safepoint explicitly as shown.

2 Likes

Thanks! It’s been a while and coming back to it, your explanation makes complete sense and with it also the text in GC.safepoint doc.

My confusion with the safepoint was that if a thread calls GC.safepoint() how is it supposed to know that it is safe, e.g., other threads can allocate at that point. Am I right that the answer to this is that all threads have to reach a safepoint for GC to do a sweep, not just one? And at least in julia up to 1.9, GC is single-threaded anyway.

2 Likes

With the caveat that this may be an abuse of terminology (I’m just inferring from documentation and experience, I don’t have any relevant credentials), yes, this is correct as far as I understand. Whenever a task needs to allocate or hits some kind of async/yield point (IO, switching tasks), it will check in with the GC and ask if it wants to do any work. This is what I called an implicit safepoint above. If the GC says yes, it first waits for every other running task to also check in before actually doing the sweep. That’s why code that’s entirely free of allocations and yield points can block GC indefinitely and you can get deadlocks as observed here unless you manually insert a safepoint.