Hanging in ijl_gc_collect() when using adopted threads

When using the MT mode of Geant4.jl the main thread hangs in an infinite loop in ijl_gc_collect(). I am using latest 1.9 version of Julia on MacOS.
The way it works is as follows:

  • The main thread (Julia REPL) calls C++ code that creates a number of worker threads. These are calling back Julia and they are all adopted into the Julia pool (I call explicitly jl_adopt_thread() but probably not necessary since the callback is done thought cfunction). So far, so good. Work is performed as expected.
  • After the run is completed, the adopted worker threads are put on wait (std::cv::wait(...)) and control is returned to the main thread (REPL). I do call jl_yield() at the end of the run and before entering the wait. The threads are waiting for a new run to eventually be started again.
  • At this moment, in a unpredictable manner, typically when generating some output, the REPL hangs in an infinite loop in `ijl_gc_collect() between addresses +256 and +264. My guess is that is looping around:
jl_gc_wait_for_the_world(gc_all_tls_states, gc_n_threads);
JL_PROBE_GC_STOP_THE_WORLD();

Is anything I can do you avoid this hanging? Many thanks in advance.

1 Like

@maleadt do you think this problem I am having could be related/fixed by https://github.com/JuliaLang/julia/pull/49934? I am not really fit to understand the internals of Julia threading and GC.

No, that would manifest as a segfault when adopting. What might be happening here (as per my limited understanding of that part of Julia) is that when a thread starts GC, it waits for all other Julia threads to reach a safepoint. Your newly adopted threads however are not at a safepoint, yet they are blocked in std::cv::wait, causing other threads to hang when attempting to enter GC. You probably want to enter a GC safe region during that wait (by calling jl_gc_safe_enter), so that GC can run during it.

Thanks very much for your hint. I added jl_gc_safe_enter and it did work (didn’t get an infinite loop). Then, I remove it to cross-check that indeed this was the solution, but now I have no infinite loops anymore. Something must have changed in the way I build the C++ wrapper that it now works. Very strange.

Is the current status of this “magically solved” (which could just mean repro is hard)?

Yes. Doing more tests I encountered again the problem. It has to do with the unpredictability of when a GC occurs. To minimize the chances of blockage, when a GC happens, all the waiting adopted threads must have called jl_gc_safe_enter just before entering the wait state. In addition, I am also disabling GC when the adopted threads are doing heavy work. With all this, seems to be quite robust.

If you have a reproducer for why you have to do that I would be interested in seeing it.

Generally speaking when your foreign threads do something blocking like std::cv::wait or heavy C++ operations, you want to transition the threads into a “GC safe region”, by using jl_gc_safe_enter and jl_gc_safe_leave. You don’t want to disable GC entirely.

Thanks for the advise. Unfortunately is not easy to have a simple reproducer. The C++ package Geant4 is large, complex and its threading model is far from simple. If I manage I will let you know.