Segfault and crash embedding when julia runs multithreaded gc

goerch · January 28, 2022, 8:05pm

I’d assume nothing about program behavior after races. Out of curiosity: do you have two different garbage collectors running (Julia and host) in the same process space? If yes, what guarantees do we have that they do not interfere?

xapkohheh · January 28, 2022, 8:10pm

I have no idea how the host program’s internals work - it’s closed source. even their API is not too well documented.
it is unlikely there is some another garbage collector there, but there can be anything else.

goerch · January 28, 2022, 8:15pm

The documentation indicates that Python scripting (which uses reference counting IIRC) is supported. But maybe Julia embedding is undefined behavior land then?

xapkohheh · January 28, 2022, 8:23pm

yes, it does use python, and python does have gc, but it is very strict on where and when python code is run, and it is guaranteed that nothing of python is run in parallel with node cook, where i do julia calls
but anyway python’s gc is single-threaded, locked with GIL, as everything there, right?

Well, i still have hope it’s a solvable situation

goerch · January 28, 2022, 8:25pm

Yes, reference counting automatic memory management, but not a mark and sweep garbage collector. These are different.

xapkohheh · January 28, 2022, 8:37pm

as far as i know cpython uses both
but still - it is highly unlikely from observed design of the host software that any python is run in parallel with node cooking

goerch · January 28, 2022, 8:50pm

To be constructive: you (seemingly successfully) tried

What happens if you collect manually via GC.gc() or the corresponding jl-call afterwards?

xapkohheh · January 28, 2022, 9:00pm

yes, i’ve tried that too:

jl_gc_enable(0);
jl_call(…);
jl_gc_enable(1);
jl_gc_collect(JL_GC_FULL);

this works stable as it seems, but only in JL_GC_FULL mode, other modes may cause a crash too.

But i cannot afford to work like this - sometimes functions being jl_called do a lot of temporary allocations, for example - in one case i can see process memory consumption going into >30 GB, and immediately down to 1 as GC is explicitly called. While running with working gc (when it doesnt crash) keeps memory usage below 2 GB

goerch · January 28, 2022, 9:13pm

Nice! Next step would be to try incremental garbage collection while running the script, but I doubt that is supported. So I’m sorry to believe that you need to look for a different design…

For the sake of completeness: I don’t know if Multi-Threading · The Julia Language is related to this discussion…

xapkohheh · January 29, 2022, 8:52am

removed all jl_options, extra signal handling, extra locking etc and just moved julia initialization into the cook method, as first thing it does on first run - and i cannot get a single segfault out of it…

rhethorical questions incoming:
but i just want to know - what is the logic in this?
constructor and cook were running in the same thread, so what’s different?
why is it that important that julia’s initializers run that late?
if on the contrary i move it up into plugin entry point - it starts crashing even more often

Edit: it did crash eventually… but it seem to happen with much less probability now

paulmelis · January 29, 2022, 9:27am

So does this mean that just initializing Julia (ie jl_init) from the plugin entry point is enough for a crash? Or does it still need a node execution?

Edit: and did you observe a crash at all when using only a single Julia thread?

xapkohheh · January 29, 2022, 9:34am

if you initialize julia in the entry point, but run code in node cook method, even though they still report they are happening on the same thread

my gut feeling tells me that somehow julia initialization needs to have all houdini’s extra threads started and running, then this sigsegv misfire doesn’t happen. So then if you run it in node cook - houdini is already as initialized as it can be. Then over time of work in the software some of it’s own threads get stopped and started, and those new threads start causing the same issue as before.

about single-threaded julia: when JULIA_NUM_THREADS is set to 1 - i have not observed a single crash ever, and i have ran it for quite some time
I haven’t observed any crashes with JULIA_NUM_THREADS>0 on non-threaded code either, but that i haven’t been testing specifically

paulmelis · January 29, 2022, 10:29am

I’m still not sure there really is an underlying issue. I.e. these segfaults may still be symptoms of Julia’s use of them in GC synchronization. Starting and stopping non-Julia threads should be no issue, it would be a severe limitation on embedding Julia.

When I run my little test snippet from above under gdb it does not raise SIGSEGV when JULIA_NUM_THREADS=1, but for multiple threads it does (although in a different location than where you see them):

melis@juggle 11:19:~$ JULIA_NUM_THREADS=3 gdb ./t_julia_embed
GNU gdb (GDB) 11.1
Reading symbols from ./t_julia_embed...
(No debugging symbols found in ./t_julia_embed)
(gdb) run
Starting program: /home/melis/t_julia_embed 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff23b9640 (LWP 4269)]
[New Thread 0x7ffff18d0640 (LWP 4270)]
[New Thread 0x7fffdd5c2640 (LWP 4271)]
[New Thread 0x7fffdcdc1640 (LWP 4272)]
[New Thread 0x7fffcffff640 (LWP 4273)]
[New Thread 0x7fffc77fe640 (LWP 4274)]
[New Thread 0x7fffbeffd640 (LWP 4275)]
3

Thread 2 "t_julia_embed" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff23b9640 (LWP 4269)]
0x00007ffff7263cd8 in jl_mutex_wait (lock=<optimized out>, safepoint=1) at /buildworker/worker/package_linux64/build/src/julia_locks.h:34
34	/buildworker/worker/package_linux64/build/src/julia_locks.h: No such file or directory.
(gdb) bt
#0  0x00007ffff7263cd8 in jl_mutex_wait (lock=<optimized out>, safepoint=1) at /buildworker/worker/package_linux64/build/src/julia_locks.h:34
#1  jl_mutex_lock (lock=<optimized out>) at /buildworker/worker/package_linux64/build/src/julia_locks.h:85
#2  jl_generate_fptr (mi=mi@entry=0x7fffe848c060, world=world@entry=31320) at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:316
#3  0x00007ffff71d526d in jl_compile_method_internal (mi=mi@entry=0x7fffe848c060, world=world@entry=31320) at /buildworker/worker/package_linux64/build/src/gf.c:1980
#4  0x00007ffff71d5ec2 in jl_compile_method_internal (world=31320, mi=0x7fffe848c060) at /buildworker/worker/package_linux64/build/src/gf.c:2246
#5  _jl_invoke (world=31320, mfunc=<optimized out>, nargs=0, args=0x7fffe8488048, F=0x7fffe82e79b0) at /buildworker/worker/package_linux64/build/src/gf.c:2239
#6  jl_apply_generic (F=<optimized out>, args=0x7fffe8488048, nargs=<optimized out>) at /buildworker/worker/package_linux64/build/src/gf.c:2429
#7  0x00007ffff71fa24f in jl_apply (nargs=1, args=<optimized out>) at /buildworker/worker/package_linux64/build/src/julia.h:1788
#8  start_task () at /buildworker/worker/package_linux64/build/src/task.c:877

Looking closer at that particular location:

github.com

JuliaLang/julia/blob/ac5cc99908d463582e66db3368b9b48fae1e2525/src/julia_locks.h#L30-L35


      
              lock->count = 1;
              return;
          }
          if (safepoint) {
              jl_gc_safepoint_(self->ptls);
          }

So line 34 calls jl_gc_safepoint_(), which is this macro:

github.com

JuliaLang/julia/blob/ac5cc99908d463582e66db3368b9b48fae1e2525/src/julia_threads.h#L301-L310


      
          // gc safepoint and gc states
          // This triggers a SegFault when we are in GC
          // Assign it to a variable to make sure the compiler emit the load
          // and to avoid Clang warning for -Wunused-volatile-lvalue
          #define jl_gc_safepoint_(ptls) do {                     \
                  jl_signal_fence();                              \
                  size_t safepoint_load = *ptls->safepoint;       \
                  jl_signal_fence();                              \
                  (void)safepoint_load;                           \
              } while (0)

Note the comment there on use of segfault.

You could try running Houdini under gdb, trigger the segfault and maybe that will give a clue as to where it really comes from.

Edit 16/3/22: in Severe thread starvation issues · Issue #41586 · JuliaLang/julia · GitHub there’s this remark, which indeed confirms the way the GC uses segfault signals:

Our GC safepoints are async based, e.g. they are implemented as a load from a location that will seqfault and the fault handler will then switch to GC

xapkohheh · January 29, 2022, 11:00am

following several incomplete backtraces on crashes - this is the exact point to where they all lead - all happen on jl_gc_safepoint_
also i have ran it with ddd with gdb - and it also lead to the same jl_gc_safepoint_ for me.

I don’t have much understanding how threaded signal handling work, and how is that Julia is capable of triggering a segfault only for it’s own threads by default, and if that segfault signal “leaks” to host program’s threads - why, and how was it originally designed not to “leak” them

paulmelis · January 29, 2022, 11:06am

Well, pointing strongly to the same direction then

I suspect when the original GC setup was done in this way, emdedding wasn’t the primary focus. This use of sigsegv will work fine as long as nothing else expects sigsegv to be sign of something having gone wrong (which is pretty much everybody). I think @yuyichao was the original author of the threading-related GC code, so maybe he has some insights how the sigsegv signaling used by Julia might interact with other software when embedding?

xapkohheh · January 30, 2022, 10:36am

I would really love to hear @yuyichao 's comment on this

For now what i’ve noticed is that when it crashes with the latest code - the sigsegv is always sent from the houdini’s cooking thread, where julia was initialized. My wild guess is that houdini manages it’s threads itself and catches all important signals without the ability for me to override that behaviour.
So maybe it worth a shot to start a completely new pthread on plugin initialization and run all julia code there - that way houdini should not be able to manage that thread. so then all julia commands should be queued from houdini’s cooking thread to julia’s “main” thread.
haven’t used pthreads directly, but I will give it a try when i get some free time, unless someone says it’s a stupid idea
Edit: maybe this will help

paulmelis · January 30, 2022, 7:57pm

I suspect Houdini is continuously updating the SIGSEGV handler just before it calls into the plugin code. As otherwise I don’t see why the handler set by jl_init() would not stay around (and Houdini would never know of the SIGSEGVs). I mean, do you ever see a crash on the first call to the SOP, or only on subsequent calls (after Julia has already been initialized, but Houdini might have restored its own signal handlers)?

If so, then this dirty hack could be worth a try. Directly after calling jl_init() save the current SIGSEGV handler information with sigaction(SIGSEGV, NULL, &julia_sigsegv_action). The julia_sigsegv_action variable is of type struct sigaction and needs to be stored somewhere it can outlive the call to cookMySop. Then, whenever cookMySop is called but Julia is already initialized restore Julia’s handler with sigaction(SIGSEGV, &julia_sigsegv_action, NULL). I’m curious if that works arounds the issue of Houdini catching the SIGSEGVs.

Finally, just as another test, if I use the below I can see dozens of SIGSEGV signals passing by

void handler(int signum)
{
    fprintf(stderr, "GOT SIGSEGV\n");
}

void *func(void*)
{   
    jl_init();  
    
    // Install own handler, i.e. like Houdini
    signal(SIGSEGV, handler);
    
    jl_eval_string("println(Threads.nthreads())");
    jl_eval_string("x=[1]; @time Threads.@threads for i in 1:100 global x=hcat(x,size(rand(10000,1000))); end");
    jl_atexit_hook(0);
    
    return NULL;
}

int main() 
{    
    pthread_t t;   
    pthread_create(&t, NULL, func, NULL);
    pthread_join(t, NULL);
}

xapkohheh · January 31, 2022, 1:57am

i’ve put julia into a dedicated thread, but that still caused sigsegv-s

but then i’ve added your cheat with sigaction: saving action on julia init and restoring it before each time julia functions run during cooks

so far i’ve extensively ran all kind of usual tests and it hasn’t yet crashed once. i know last time i said the same it crashed shortly after, so i’ll keep testing it. But whatever the result - thank you for this valuable advice!

but what i’ve understood is that i completely don’t understand how threaded signal handling works… i thought each thread has it’s own handlers, so then julia being in a separate thread would be fine living away from houdini. but as i see this trick - it appears that handlers are not bound to a single thread after all… need to read about that in spare time

Topic		Replies	Views
Segmentation fault using multithreaded julia on new server General Usage question , segfault	0	153	May 22, 2024
GC problems with `jl_gc_unsafe_enter` with multithreaded embedding General Usage embedding , garbage-collection , java	2	358	January 25, 2024
How-to debug intermittent memory corruption errors (segfaults during GC) in large codebases General Usage question	16	1496	March 30, 2019
Embedding Julia and multiple foreign threads General Usage question , multithreading	3	112	December 4, 2024
GC extensions segfault General Usage question	5	479	December 10, 2018

Segfault and crash embedding when julia runs multithreaded gc

Related topics