Segfault and crash embedding when julia runs multithreaded gc

I’d assume nothing about program behavior after races. Out of curiosity: do you have two different garbage collectors running (Julia and host) in the same process space? If yes, what guarantees do we have that they do not interfere?

I have no idea how the host program’s internals work - it’s closed source. even their API is not too well documented.
it is unlikely there is some another garbage collector there, but there can be anything else.

The documentation indicates that Python scripting (which uses reference counting IIRC) is supported. But maybe Julia embedding is undefined behavior land then?

yes, it does use python, and python does have gc, but it is very strict on where and when python code is run, and it is guaranteed that nothing of python is run in parallel with node cook, where i do julia calls
but anyway python’s gc is single-threaded, locked with GIL, as everything there, right?

Well, i still have hope it’s a solvable situation :slight_smile:

Yes, reference counting automatic memory management, but not a mark and sweep garbage collector. These are different.

as far as i know cpython uses both
but still - it is highly unlikely from observed design of the host software that any python is run in parallel with node cooking

To be constructive: you (seemingly successfully) tried

What happens if you collect manually via GC.gc() or the corresponding jl-call afterwards?

yes, i’ve tried that too:

jl_gc_enable(0);
jl_call(…);
jl_gc_enable(1);
jl_gc_collect(JL_GC_FULL);

this works stable as it seems, but only in JL_GC_FULL mode, other modes may cause a crash too.

But i cannot afford to work like this - sometimes functions being jl_called do a lot of temporary allocations, for example - in one case i can see process memory consumption going into >30 GB, and immediately down to 1 as GC is explicitly called. While running with working gc (when it doesnt crash) keeps memory usage below 2 GB

Nice! Next step would be to try incremental garbage collection while running the script, but I doubt that is supported. So I’m sorry to believe that you need to look for a different design…

For the sake of completeness: I don’t know if Multi-Threading · The Julia Language is related to this discussion…

removed all jl_options, extra signal handling, extra locking etc and just moved julia initialization into the cook method, as first thing it does on first run - and i cannot get a single segfault out of it…

rhethorical questions incoming:
but i just want to know - what is the logic in this?
constructor and cook were running in the same thread, so what’s different?
why is it that important that julia’s initializers run that late?
if on the contrary i move it up into plugin entry point - it starts crashing even more often

Edit: it did crash eventually… but it seem to happen with much less probability now

So does this mean that just initializing Julia (ie jl_init) from the plugin entry point is enough for a crash? Or does it still need a node execution?

Edit: and did you observe a crash at all when using only a single Julia thread?

if you initialize julia in the entry point, but run code in node cook method, even though they still report they are happening on the same thread

my gut feeling tells me that somehow julia initialization needs to have all houdini’s extra threads started and running, then this sigsegv misfire doesn’t happen. So then if you run it in node cook - houdini is already as initialized as it can be. Then over time of work in the software some of it’s own threads get stopped and started, and those new threads start causing the same issue as before.

about single-threaded julia: when JULIA_NUM_THREADS is set to 1 - i have not observed a single crash ever, and i have ran it for quite some time
I haven’t observed any crashes with JULIA_NUM_THREADS>0 on non-threaded code either, but that i haven’t been testing specifically

I’m still not sure there really is an underlying issue. I.e. these segfaults may still be symptoms of Julia’s use of them in GC synchronization. Starting and stopping non-Julia threads should be no issue, it would be a severe limitation on embedding Julia.

When I run my little test snippet from above under gdb it does not raise SIGSEGV when JULIA_NUM_THREADS=1, but for multiple threads it does (although in a different location than where you see them):

melis@juggle 11:19:~$ JULIA_NUM_THREADS=3 gdb ./t_julia_embed
GNU gdb (GDB) 11.1
Reading symbols from ./t_julia_embed...
(No debugging symbols found in ./t_julia_embed)
(gdb) run
Starting program: /home/melis/t_julia_embed 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff23b9640 (LWP 4269)]
[New Thread 0x7ffff18d0640 (LWP 4270)]
[New Thread 0x7fffdd5c2640 (LWP 4271)]
[New Thread 0x7fffdcdc1640 (LWP 4272)]
[New Thread 0x7fffcffff640 (LWP 4273)]
[New Thread 0x7fffc77fe640 (LWP 4274)]
[New Thread 0x7fffbeffd640 (LWP 4275)]
3

Thread 2 "t_julia_embed" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff23b9640 (LWP 4269)]
0x00007ffff7263cd8 in jl_mutex_wait (lock=<optimized out>, safepoint=1) at /buildworker/worker/package_linux64/build/src/julia_locks.h:34
34	/buildworker/worker/package_linux64/build/src/julia_locks.h: No such file or directory.
(gdb) bt
#0  0x00007ffff7263cd8 in jl_mutex_wait (lock=<optimized out>, safepoint=1) at /buildworker/worker/package_linux64/build/src/julia_locks.h:34
#1  jl_mutex_lock (lock=<optimized out>) at /buildworker/worker/package_linux64/build/src/julia_locks.h:85
#2  jl_generate_fptr (mi=mi@entry=0x7fffe848c060, world=world@entry=31320) at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:316
#3  0x00007ffff71d526d in jl_compile_method_internal (mi=mi@entry=0x7fffe848c060, world=world@entry=31320) at /buildworker/worker/package_linux64/build/src/gf.c:1980
#4  0x00007ffff71d5ec2 in jl_compile_method_internal (world=31320, mi=0x7fffe848c060) at /buildworker/worker/package_linux64/build/src/gf.c:2246
#5  _jl_invoke (world=31320, mfunc=<optimized out>, nargs=0, args=0x7fffe8488048, F=0x7fffe82e79b0) at /buildworker/worker/package_linux64/build/src/gf.c:2239
#6  jl_apply_generic (F=<optimized out>, args=0x7fffe8488048, nargs=<optimized out>) at /buildworker/worker/package_linux64/build/src/gf.c:2429
#7  0x00007ffff71fa24f in jl_apply (nargs=1, args=<optimized out>) at /buildworker/worker/package_linux64/build/src/julia.h:1788
#8  start_task () at /buildworker/worker/package_linux64/build/src/task.c:877

Looking closer at that particular location:

https://github.com/JuliaLang/julia/blob/ac5cc99908d463582e66db3368b9b48fae1e2525/src/julia_locks.h#L30-L35

So line 34 calls jl_gc_safepoint_(), which is this macro:

https://github.com/JuliaLang/julia/blob/ac5cc99908d463582e66db3368b9b48fae1e2525/src/julia_threads.h#L301-L310

Note the comment there on use of segfault.

You could try running Houdini under gdb, trigger the segfault and maybe that will give a clue as to where it really comes from.

Edit 16/3/22: in https://github.com/JuliaLang/julia/issues/41586#issuecomment-881311072 there’s this remark, which indeed confirms the way the GC uses segfault signals:

Our GC safepoints are async based, e.g. they are implemented as a load from a location that will seqfault and the fault handler will then switch to GC

following several incomplete backtraces on crashes - this is the exact point to where they all lead - all happen on jl_gc_safepoint_
also i have ran it with ddd with gdb - and it also lead to the same jl_gc_safepoint_ for me.

I don’t have much understanding how threaded signal handling work, and how is that Julia is capable of triggering a segfault only for it’s own threads by default, and if that segfault signal “leaks” to host program’s threads - why, and how was it originally designed not to “leak” them

Well, pointing strongly to the same direction then

I suspect when the original GC setup was done in this way, emdedding wasn’t the primary focus. This use of sigsegv will work fine as long as nothing else expects sigsegv to be sign of something having gone wrong (which is pretty much everybody). I think @yuyichao was the original author of the threading-related GC code, so maybe he has some insights how the sigsegv signaling used by Julia might interact with other software when embedding?

I would really love to hear @yuyichao 's comment on this

For now what i’ve noticed is that when it crashes with the latest code - the sigsegv is always sent from the houdini’s cooking thread, where julia was initialized. My wild guess is that houdini manages it’s threads itself and catches all important signals without the ability for me to override that behaviour.
So maybe it worth a shot to start a completely new pthread on plugin initialization and run all julia code there - that way houdini should not be able to manage that thread. so then all julia commands should be queued from houdini’s cooking thread to julia’s “main” thread.
haven’t used pthreads directly, but I will give it a try when i get some free time, unless someone says it’s a stupid idea
Edit: maybe this will help

I suspect Houdini is continuously updating the SIGSEGV handler just before it calls into the plugin code. As otherwise I don’t see why the handler set by jl_init() would not stay around (and Houdini would never know of the SIGSEGVs). I mean, do you ever see a crash on the first call to the SOP, or only on subsequent calls (after Julia has already been initialized, but Houdini might have restored its own signal handlers)?

If so, then this dirty hack could be worth a try. Directly after calling jl_init() save the current SIGSEGV handler information with sigaction(SIGSEGV, NULL, &julia_sigsegv_action). The julia_sigsegv_action variable is of type struct sigaction and needs to be stored somewhere it can outlive the call to cookMySop. Then, whenever cookMySop is called but Julia is already initialized restore Julia’s handler with sigaction(SIGSEGV, &julia_sigsegv_action, NULL). I’m curious if that works arounds the issue of Houdini catching the SIGSEGVs.

Finally, just as another test, if I use the below I can see dozens of SIGSEGV signals passing by :slight_smile:

void handler(int signum)
{
    fprintf(stderr, "GOT SIGSEGV\n");
}

void *func(void*)
{   
    jl_init();  
    
    // Install own handler, i.e. like Houdini
    signal(SIGSEGV, handler);
    
    jl_eval_string("println(Threads.nthreads())");
    jl_eval_string("x=[1]; @time Threads.@threads for i in 1:100 global x=hcat(x,size(rand(10000,1000))); end");
    jl_atexit_hook(0);
    
    return NULL;
}

int main() 
{    
    pthread_t t;   
    pthread_create(&t, NULL, func, NULL);
    pthread_join(t, NULL);
}
1 Like

i’ve put julia into a dedicated thread, but that still caused sigsegv-s

but then i’ve added your cheat with sigaction: saving action on julia init and restoring it before each time julia functions run during cooks

so far i’ve extensively ran all kind of usual tests and it hasn’t yet crashed once. i know last time i said the same it crashed shortly after, so i’ll keep testing it. But whatever the result - thank you for this valuable advice!

but what i’ve understood is that i completely don’t understand how threaded signal handling works… i thought each thread has it’s own handlers, so then julia being in a separate thread would be fine living away from houdini. but as i see this trick - it appears that handlers are not bound to a single thread after all… need to read about that in spare time