When using the MT mode of Geant4.jl the main thread hangs in an infinite loop in ijl_gc_collect(). I am using latest 1.9 version of Julia on MacOS.
The way it works is as follows:
The main thread (Julia REPL) calls C++ code that creates a number of worker threads. These are calling back Julia and they are all adopted into the Julia pool (I call explicitly jl_adopt_thread() but probably not necessary since the callback is done thought cfunction). So far, so good. Work is performed as expected.
After the run is completed, the adopted worker threads are put on wait (std::cv::wait(...)) and control is returned to the main thread (REPL). I do call jl_yield() at the end of the run and before entering the wait. The threads are waiting for a new run to eventually be started again.
At this moment, in a unpredictable manner, typically when generating some output, the REPL hangs in an infinite loop in `ijl_gc_collect() between addresses +256 and +264. My guess is that is looping around:
No, that would manifest as a segfault when adopting. What might be happening here (as per my limited understanding of that part of Julia) is that when a thread starts GC, it waits for all other Julia threads to reach a safepoint. Your newly adopted threads however are not at a safepoint, yet they are blocked in std::cv::wait, causing other threads to hang when attempting to enter GC. You probably want to enter a GC safe region during that wait (by calling jl_gc_safe_enter), so that GC can run during it.
Thanks very much for your hint. I added jl_gc_safe_enter and it did work (didn’t get an infinite loop). Then, I remove it to cross-check that indeed this was the solution, but now I have no infinite loops anymore. Something must have changed in the way I build the C++ wrapper that it now works. Very strange.
Yes. Doing more tests I encountered again the problem. It has to do with the unpredictability of when a GC occurs. To minimize the chances of blockage, when a GC happens, all the waiting adopted threads must have called jl_gc_safe_enter just before entering the wait state. In addition, I am also disabling GC when the adopted threads are doing heavy work. With all this, seems to be quite robust.
If you have a reproducer for why you have to do that I would be interested in seeing it.
Generally speaking when your foreign threads do something blocking like std::cv::wait or heavy C++ operations, you want to transition the threads into a “GC safe region”, by using jl_gc_safe_enter and jl_gc_safe_leave. You don’t want to disable GC entirely.
Thanks for the advise. Unfortunately is not easy to have a simple reproducer. The C++ package Geant4 is large, complex and its threading model is far from simple. If I manage I will let you know.