When using the MT mode of Geant4.jl the main thread hangs in an infinite loop in
ijl_gc_collect(). I am using latest 1.9 version of Julia on MacOS.
The way it works is as follows:
- The main thread (Julia REPL) calls C++ code that creates a number of worker threads. These are calling back Julia and they are all adopted into the Julia pool (I call explicitly
jl_adopt_thread() but probably not necessary since the callback is done thought
cfunction). So far, so good. Work is performed as expected.
- After the run is completed, the adopted worker threads are put on wait (
std::cv::wait(...)) and control is returned to the main thread (REPL). I do call
jl_yield() at the end of the run and before entering the wait. The threads are waiting for a new run to eventually be started again.
- At this moment, in a unpredictable manner, typically when generating some output, the REPL hangs in an infinite loop in `ijl_gc_collect() between addresses +256 and +264. My guess is that is looping around:
Is anything I can do you avoid this hanging? Many thanks in advance.
@maleadt do you think this problem I am having could be related/fixed by https://github.com/JuliaLang/julia/pull/49934? I am not really fit to understand the internals of Julia threading and GC.
No, that would manifest as a segfault when adopting. What might be happening here (as per my limited understanding of that part of Julia) is that when a thread starts GC, it waits for all other Julia threads to reach a safepoint. Your newly adopted threads however are not at a safepoint, yet they are blocked in
std::cv::wait, causing other threads to hang when attempting to enter GC. You probably want to enter a GC safe region during that wait (by calling
jl_gc_safe_enter), so that GC can run during it.
Thanks very much for your hint. I added
jl_gc_safe_enter and it did work (didn’t get an infinite loop). Then, I remove it to cross-check that indeed this was the solution, but now I have no infinite loops anymore. Something must have changed in the way I build the C++ wrapper that it now works. Very strange.
Is the current status of this “magically solved” (which could just mean repro is hard)?
Yes. Doing more tests I encountered again the problem. It has to do with the unpredictability of when a GC occurs. To minimize the chances of blockage, when a GC happens, all the waiting adopted threads must have called
jl_gc_safe_enter just before entering the wait state. In addition, I am also disabling GC when the adopted threads are doing heavy work. With all this, seems to be quite robust.
If you have a reproducer for why you have to do that I would be interested in seeing it.
Generally speaking when your foreign threads do something blocking like
std::cv::wait or heavy C++ operations, you want to transition the threads into a “GC safe region”, by using
jl_gc_safe_leave. You don’t want to disable GC entirely.
Thanks for the advise. Unfortunately is not easy to have a simple reproducer. The C++ package Geant4 is large, complex and its threading model is far from simple. If I manage I will let you know.