Hanging in ijl_gc_collect() when using adopted threads

peremato · May 22, 2023, 1:48pm

When using the MT mode of Geant4.jl the main thread hangs in an infinite loop in ijl_gc_collect(). I am using latest 1.9 version of Julia on MacOS.
The way it works is as follows:

The main thread (Julia REPL) calls C++ code that creates a number of worker threads. These are calling back Julia and they are all adopted into the Julia pool (I call explicitly jl_adopt_thread() but probably not necessary since the callback is done thought cfunction). So far, so good. Work is performed as expected.
After the run is completed, the adopted worker threads are put on wait (std::cv::wait(...)) and control is returned to the main thread (REPL). I do call jl_yield() at the end of the run and before entering the wait. The threads are waiting for a new run to eventually be started again.
At this moment, in a unpredictable manner, typically when generating some output, the REPL hangs in an infinite loop in `ijl_gc_collect() between addresses +256 and +264. My guess is that is looping around:

jl_gc_wait_for_the_world(gc_all_tls_states, gc_n_threads);
JL_PROBE_GC_STOP_THE_WORLD();

Is anything I can do you avoid this hanging? Many thanks in advance.

peremato · May 30, 2023, 12:11pm

@maleadt do you think this problem I am having could be related/fixed by https://github.com/JuliaLang/julia/pull/49934? I am not really fit to understand the internals of Julia threading and GC.

maleadt · May 30, 2023, 6:16pm

No, that would manifest as a segfault when adopting. What might be happening here (as per my limited understanding of that part of Julia) is that when a thread starts GC, it waits for all other Julia threads to reach a safepoint. Your newly adopted threads however are not at a safepoint, yet they are blocked in std::cv::wait, causing other threads to hang when attempting to enter GC. You probably want to enter a GC safe region during that wait (by calling jl_gc_safe_enter), so that GC can run during it.

peremato · May 31, 2023, 7:04am

Thanks very much for your hint. I added jl_gc_safe_enter and it did work (didn’t get an infinite loop). Then, I remove it to cross-check that indeed this was the solution, but now I have no infinite loops anymore. Something must have changed in the way I build the C++ wrapper that it now works. Very strange.

jling · May 31, 2023, 1:58pm

Is the current status of this “magically solved” (which could just mean repro is hard)?

peremato · May 31, 2023, 3:57pm

Yes. Doing more tests I encountered again the problem. It has to do with the unpredictability of when a GC occurs. To minimize the chances of blockage, when a GC happens, all the waiting adopted threads must have called jl_gc_safe_enter just before entering the wait state. In addition, I am also disabling GC when the adopted threads are doing heavy work. With all this, seems to be quite robust.

vchuravy · May 31, 2023, 8:19pm

If you have a reproducer for why you have to do that I would be interested in seeing it.

Generally speaking when your foreign threads do something blocking like std::cv::wait or heavy C++ operations, you want to transition the threads into a “GC safe region”, by using jl_gc_safe_enter and jl_gc_safe_leave. You don’t want to disable GC entirely.

peremato · June 1, 2023, 7:08am

Thanks for the advise. Unfortunately is not easy to have a simple reproducer. The C++ package Geant4 is large, complex and its threading model is far from simple. If I manage I will let you know.

Topic		Replies	Views
Multithreaded program hangs without explict GC.gc() General Usage question , multithreading , garbage-collection	6	922	July 20, 2023
GC problems with `jl_gc_unsafe_enter` with multithreaded embedding General Usage embedding , garbage-collection , java	2	360	January 25, 2024
Issues with foreign threads calling back Julia and GC General Usage question , hep	11	464	September 29, 2023
Async hangs with continue General Usage	5	158	May 20, 2024
Channel hangs with >8 threads but not 7 in REPL New to Julia question	6	613	December 1, 2021

Hanging in ijl_gc_collect() when using adopted threads

Related topics