In the Geant4.jl project I am wrapping a large C++ library that supports multi-threading for tracking particles through matter. I have found an almost working solution by disabling the GC, but if the run is very long, I do run out of memory. I am looking for a way to trigger the GC at given defined checkpoints.
The way it works is as follows:
- The main Julia thread triggers the creation of a number of C++ threads that after some initialization are put on hold waiting for work to be done. In these foreign threads, I have called
jl_adopt_thread()
and before they are suspended I call jl_gc_safe_enter()
. If I was not calling the latest, I was getting a deadlock in the main thread whenever a garbage collection was triggered.
- Then, from the main Julia thread I start a run that will schedule work to these foreign threads and will be calling back to Julia. Unfortunately, I do get some memory allocation of about 16 or 20 bytes per callback, which I will try to get rid off. In addition, when starting a run I need to disable GC with
GC.enable(false)
because without it, there was an infinite loop between the threads having to do with the JIT compilation of the callbacks.
- I re-enable GC when the call returns from C++. This works well for short runs. But not, if billions of callbacks are executed since garbage is never collected.
- I tried to disable GC only for a short time but it does not work. Instead of doing
GC.enable(false)
BeamOn(app.runmanager, nevents) # if nevents is large then may take minutes or hours
GC.enable(true)
I changed it to wait for the completion of the run in another Julia thread
GC.enable(false)
rt = Threads.@spawn BeamOn(app.runmanager, events)
GC.enable(true)
wait(rt)
but it does not work. It never gets the garbage collected. I tried with other variations with istaskdone(rt)
sleeping and issuing a gc()
explicitly but does not seem to work either.
Does anybody has an idea on how to issue a barrier in the working (C++) threads after a while during running, and forcing a garbage collection?
1 Like
why are you disabling the gc at all? you should be able to just gc_preserve the memory that C is touching.
Your C++ threads could manually call the safepoint function.
Probably better is to manage the GC state transition per thread more precisely. E.g. transition from unsafe->safe when your are executing just C++ code ans transition back from safe->unsafe when you are calling Julia functions or returning to Julia.
That way GC can run without waiting for the C++ threads.
1 Like
It is experimental if I do not do it, it hangs. If I removed the disabling GC I get the following
Trace sample
Call graph:
901 Thread_3577232 DispatchQueue_1: com.apple.main-thread (serial)
+ 901 start (in libdyld.dylib) + 1 [0x7fff208c0f3d]
+ 901 main (in julia) + 9 [0x10ed88f79]
+ 901 jl_repl_entrypoint (in libjulia-internal.1.9.dylib) + 168 [0x10eeb4298]
+ 901 true_main (in libjulia-internal.1.9.dylib) + 179 [0x10eeb4393]
+ 901 jfptr__start_49586.clone_1 (in sys.dylib) + 9 [0x11d0204f9]
+ 901 julia__start_49585 (in sys.dylib) + 510 [0x11d0203fe] client.jl:522
+ 901 julia_exec_options_52064 (in sys.dylib) + 32699 [0x11d8db4fb] client.jl:307
+ 901 julia_include_52726 (in sys.dylib) + 49 [0x11d51eb31] Base.jl:457
+ 901 japi1__include_45072 (in sys.dylib) + 1096 [0x11dc3f598] loading.jl:1963
+ 901 ijl_apply_generic (in libjulia-internal.1.9.dylib) + 1924 [0x10ee58004]
+ 901 japi1_include_string_53997 (in sys.dylib) + 542 [0x11ddc4eee] loading.jl:1903
+ 901 ijl_toplevel_eval_in (in libjulia-internal.1.9.dylib) + 150 [0x10ee89e36]
+ 901 jl_toplevel_eval_flex (in libjulia-internal.1.9.dylib) + 4536 [0x10ee89088]
+ 901 jl_toplevel_eval_flex (in libjulia-internal.1.9.dylib) + 4760 [0x10ee89168]
+ 901 jl_interpret_toplevel_thunk (in libjulia-internal.1.9.dylib) + 261 [0x10ee6ec95]
+ 901 eval_body (in libjulia-internal.1.9.dylib) + 1454 [0x10ee6e61e]
+ 901 do_call (in libjulia-internal.1.9.dylib) + 206 [0x10ee7002e]
+ 901 ??? (in <unknown binary>) [0x11a7dde9a]
+ 901 ??? (in <unknown binary>) [0x11a7dde66]
+ 901 ??? (in <unknown binary>) [0x11a7ddf25]
+ 901 jlcxx::detail::CallFunctor<void, G4RunManager&, int>::apply(void const*, jlcxx::WrappedCppPtr, int) (in libGeant4Wrap.dylib) + 56 [0x14f5a9a98]
+ 901 G4RunManager::BeamOn(int, char const*, int) (in libG4run.dylib) + 107 [0x1443aaa3b]
+ 901 G4MTRunManager::RunTermination() (in libG4run.dylib) + 18 [0x14439cc62]
+ 901 G4MTRunManager::WaitForEndEventLoopWorkers() (in libG4run.dylib) + 40 [0x14439d698]
+ 901 G4MTBarrier::Wait() (in libG4global.dylib) + 110 [0x11c484dfe]
+ 901 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) (in libc++.1.dylib) + 18 [0x7fff2080ed72]
+ 901 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff208a5e49]
+ 901 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff20872cbe]
901 Thread_3577241
+ 901 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff208a1443]
+ 901 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff208a58fc]
+ 901 signal_listener (in libjulia-internal.1.9.dylib) + 667 [0x10eeb60ab]
+ 901 kevent (in libsystem_kernel.dylib) + 10 [0x7fff20874c2a]
901 Thread_3577242
+ 901 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff208a1443]
+ 901 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff208a58fc]
+ 901 mach_segv_listener (in libjulia-internal.1.9.dylib) + 29 [0x10eeb4d7d]
+ 901 mach_msg_server (in libsystem_kernel.dylib) + 305 [0x7fff208762c7]
+ 901 mach_msg (in libsystem_kernel.dylib) + 60 [0x7fff2087060c]
+ 901 mach_msg_trap (in libsystem_kernel.dylib) + 10 [0x7fff2087029a]
901 Thread_3577263
+ 901 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff208a1443]
+ 901 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff208a58fc]
+ 901 blas_thread_server (in libopenblas64_.0.3.21.dylib) + 207 [0x127f3f4ef]
+ 901 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff208a5e49]
+ 901 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff20872cbe]
901 Thread_3577264
+ 901 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff208a1443]
+ 901 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff208a58fc]
+ 901 blas_thread_server (in libopenblas64_.0.3.21.dylib) + 207 [0x127f3f4ef]
+ 901 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff208a5e49]
+ 901 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff20872cbe]
901 Thread_3577265
+ 901 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff208a1443]
+ 901 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff208a58fc]
+ 901 blas_thread_server (in libopenblas64_.0.3.21.dylib) + 207 [0x127f3f4ef]
+ 901 _pthread_cond_wait (in libsystem_pthread.dylib) + 1298 [0x7fff208a5e49]
+ 901 __psynch_cvwait (in libsystem_kernel.dylib) + 10 [0x7fff20872cbe]
901 Thread_3577743
+ 901 start_wqthread (in libsystem_pthread.dylib) + 15 [0x7fff208a142f]
+ 901 _pthread_wqthread (in libsystem_pthread.dylib) + 414 [0x7fff208a24c1]
+ 901 __workq_kernreturn (in libsystem_kernel.dylib) + 10 [0x7fff2087193e]
901 Thread_3578113
+ 901 thread_start (in libsystem_pthread.dylib) + 15 [0x7fff208a1443]
+ 901 _pthread_start (in libsystem_pthread.dylib) + 224 [0x7fff208a58fc]
+ 901 uv__cf_loop_runner (in libjulia-internal.1.9.dylib) + 107 [0x10ef2cbdb]
+ 901 CFRunLoopRun (in CoreFoundation) + 40 [0x7fff20a222b2]
+ 901 CFRunLoopRunSpecific (in CoreFoundation) + 563 [0x7fff2099b9ac]
+ 901 __CFRunLoopRun (in CoreFoundation) + 1328 [0x7fff2099c59f]
+ 901 __CFRunLoopServiceMachPort (in CoreFoundation) + 316 [0x7fff2099debf]
+ 901 mach_msg (in libsystem_kernel.dylib) + 60 [0x7fff2087060c]
+ 901 mach_msg_trap (in libsystem_kernel.dylib) + 10 [0x7fff2087029a]
901 Thread_3578470
+ 901 ??? (in <unknown binary>) [0x11a7cc560]
901 Thread_3578471
+ 901 ??? (in <unknown binary>) [0x11a7cba94]
+ 901 ??? (in <unknown binary>) [0x11a7cb28e]
+ 901 ijl_gc_pool_alloc (in libjulia-internal.1.9.dylib) + 15 [0x10eea427f]
+ 901 jl_gc_pool_alloc_inner (in libjulia-internal.1.9.dylib) + 41 [0x10eea42d9]
+ 901 ijl_gc_collect (in libjulia-internal.1.9.dylib) + 258,262,... [0x10eea7fb2,0x10eea7fb6,...]
901 Thread_3578472
+ 901 ??? (in <unknown binary>) [0x11a7cbef4]
+ 901 ??? (in <unknown binary>) [0x11a7cb2c7]
+ 901 ??? (in <unknown binary>) [0x11a7def87]
+ 901 ??? (in <unknown binary>) [0x11a7a60e8]
+ 901 ijl_cstr_to_string (in libjulia-internal.1.9.dylib) + 29 [0x10ee75e7d]
+ 901 ijl_alloc_string (in libjulia-internal.1.9.dylib) + 153 [0x10ee75e09]
+ 901 jl_gc_pool_alloc_inner (in libjulia-internal.1.9.dylib) + 47 [0x10eea42df]
901 Thread_3578473
901 ??? (in <unknown binary>) [0x3cacdda30e9f82af]
901 ??? (in <unknown binary>) [0x11a7a60e8]
901 ijl_cstr_to_string (in libjulia-internal.1.9.dylib) + 29 [0x10ee75e7d]
901 ijl_alloc_string (in libjulia-internal.1.9.dylib) + 153 [0x10ee75e09]
901 jl_gc_pool_alloc_inner (in libjulia-internal.1.9.dylib) + 47 [0x10eea42df]
The main tread is already waiting for the C++ worker threads to finish and they have started doing a gc collection. I should probably call in this case also jl_gc_safe_enter()
before the wait but I do no have a user hook in this case. I’ll investigate.
So if your callbacks are using @cfunction
you shouldn’t need to do adopt thread at all. It’s automatic. And it should do the safe/unsafe dance. (This code hasn’t been externsively tested but that is the expected behaviour)
One thing to do is to mark yourself GC safe after adopt thread. The callbacks handle the state transition themselves, so as long as you don’t use any runtime code yourself that should be fine.
Yes, I see that I could do something like this but in practice I am a bit confused by the absence of documentation.
Is the state safe/unsafe per thread? Do I need to call ‘enter’ and then ‘leave’ storing the state in between?
I could always start with ‘safe’ when calling C++ and then in each callback calling Julia ‘enter unsafe’ and ‘leave unsafe’ on return.
Can I call these functions from Julia? should I use ccall
? Is there a big performance penalty?
Good to know. But still I have to tell Julia that is ‘safe’ before putting any C++ to wait for something. I was not doing it for the main thread that is waiting for the worker threads to finish.
Yeah Ideally you need to store the state in between. Since you could be GC unsafe, the enter an inner region that also marks itself as GC unsafe and then exit the inner one, and after that you still want to be GC unsafe.
So they are matched and you should pass the old state between them.
The state is per thread and basically means that GC can run while this thread is in GC safe without waiting for the world.
Julia has these functions exposed in Base.GC
, but generally you shouldn’t need to call them there.
One exception is around a ccall
. I was working on Allow for :foreigncall to transition to GC safe automatically by vchuravy · Pull Request #49933 · JuliaLang/julia · GitHub to make that easier for the user. With the idea being that long running foreign-calls could be marked unsafe automatically.
The performance penalty is small.
1 Like
Thanks very much.
I do not see the functions in Base.GC
, are they hidden?
I see that only case I need to call them is when I put the foreign thread to wait. In this case I have to ensure that is in the safe state otherwise it will deadlock if a GC happens.
Ah we don’t have them exposed… Hm might be fine to call them, but it’s tricky since cconvert
might allocate, which is why my PR is necessary.
Thanks very much. I think I have solved my problem. Just adding safe
in the main thread fixes it, and I do not need to disable GC.
state = ccall(:jl_gc_safe_enter,Cint,())
BeamOn(app.runmanager, nevents)
ccall(:jl_gc_safe_leave,Cint,(Cint,), state)