Segfault and crash embedding when julia runs multithreaded gc

Hello! I’m very new to julia, but immediately tried to embed it into a node-based CG software (sidefx houdini) for geometry processing.
everything worked perfectly fine, until i ran a multithreaded loop that did a bunch of allocations. in that case it has observed 90% chance to crash mostly without stack trace.

  • I am sure all julia-related functions are always called from the same thread (though parent software itself is multithreaded)
  • i tried rooting literally every variable that cases to jl_value_t* - it had no effect on crashes
  • i tried disabling gc with calling

jl_gc_enable(0);
jl_call(…);
jl_gc_enable(1);
and that totally solves crashes, but i cannot afford running code without garbage collection.

  • in the end i could reduce it to a single call that crashes 90% of the time if more than one thread:

jl_eval_string(code);

where code is this below:

x=[1]
@time Threads.@threads for i in 1:100
global x=hcat(x,size(rand(10000,1000)));
end

if i either don’t use Threads module, or specify JULIA_NUM_THREADS=1 - nothing ever crashes

attaching some random generated traces (mostly it crashes with no messages, nothign at all)
some stacktraces

Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 9 3950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
Environment:
JULIA_NUM_THREADS = 32

also tried julia 1.6.5 - same, tried different versions of houdini - same problem.
and obviously there is no problems running any of that threaded code in REPL.

forgot to mention: IF code doesn’t crash on the first run - it will run fine after. some code (with not much allocations) i was running in a loop for 20 minutes - it never crashed. another more complicated code (that uses LinearAlgebra module) had a small chance to crash on incoming data size change. That brings me to think that compiler together with gc are involved here, or it’s just generally much more gc calls when compiler is involved.

my skills are not enough to debug it, and i have no idea at this point where to look
thank you

It seems Houdini’s own signal/segfault handler is active. Maybe if you disable it (as per Automatic Crash Recovery in their docs) you might get more detailed stack trace information by inspecting the core dump file? Just an idea.

tried that just now. it still mostly crashes without any messages at all, observed probability of a stacktrace on crash is <5%, don’t know what cosmic powers affect that.
for some reason i cannot make it to generate a coredump file either, though i did set ulimit -c unlimited in bash, and managed to coredump some test programs. but again, my expertise in debugging goes as far as printf :slight_smile:

Something to check is if jl_eval_string() is called from the same thread as where jl_init() is called during plugin initialization. You can (probably) call printf("%d\n", pthread_self()) and check if the IDs are the same.

I’m mostly a Blender user and briefly tried to see if installing Houdini would be easy, but it’s not (as I’m on Arch Linux, which it doesn’t support).

yes, it is the same thread, i am confident, and i’ve checked it right now once more with what you suggested)
to be mega sure i just cramped both init and evaluation into one function - the outcome is the same.
i’ve also tried (out of desperation) using global thread lock i found in HDK just to make sure no other houdini threads are doing anything at that moment (though i cannot be sure what does that mutex lock exactly, as it is very poorly documented). And that had no effect on the problem

also attaching the list of gcc flags used, just in case
most of them come from HDK’s requirements (including a bunch of defines)
and just last 2 is directly julia-related

g++ flags

I’ve been trying to get a simple multi-threaded example to crash on my end here, but no luck so far. For example:

// gcc -pthread -o t_threading `julia /usr/share/julia/julia-config.jl --cflags --ldflags --ldlibs` t_threading.c 
#include <julia/julia.h>
JULIA_DEFINE_FAST_TLS

void *func(void*)
{   
    jl_init();  
    jl_eval_string("println(Threads.nthreads())");
    jl_eval_string("x=[1]; @time Threads.@threads for i in 1:100 global x=hcat(x,size(rand(10000,1000))); end");
    jl_atexit_hook(0);
    
    return NULL;
}

int main() 
{
    pthread_t t;
    
    pthread_create(&t, NULL, func, NULL);
    pthread_join(t, NULL);
}

Note that there are some implied restrictions when embedding in a multi-threaded environment. The docs don’t mentioned these, but threads like this one suggest that calling Julia API functions from a thread that was not started by Julia is unsupported, and doing so could lead to all kinds of nasty behaviour. So to put simply, you should only call jl_... functions from either the thread in which jl_init() was called, or from a thread that was started by Julia (and not from some other thread you started). Can you show some of your code to see what it does?

And does this segfault occur already the first time you run your plugin? Or does it take multiple runs?

I am very sure all julia calls are happening from the same thread.

here’s the whole code (i’ll put it in a proper repo a bit later)
there’s a lot of boiler plate, so important points are:

  • 25: newSopOperator - is an entry point for the plugin
  • 62: the constructor of the node object - here jl_init() is being called the first time the first node is created (i moved it around for tests everywhere possible - no effect on the problem)
  • 98: cookMySop - this method is called every time node needs to process geometry. There’s a bunch of houdini-related things, but important parts are following:
  • 214: julia function code evaluation. It is even here (or literally anywhere at all, in constructor or in cookMySop) if i execute said threaded loop - it will crash.
    It doesnt even matter what happens next, since it crashes even here, but just to name it:
  • 246: here a vector of arguments for next jl_call is declared, and
  • 283: is where jl_call happens
    But again, it will crash on threaded code even on line 214, or even if i put this jl_eval_string on line 99, the very beginning of cookMySop, or hell, even if i put it in constructor, right after jl_init call on line 66

so yeah, even if i call jl_eval_string right after jl_init in the constructor (or even in plugin hook) - it will crash

About if this segfault occur already the first time: - it has high chance to occur on first multithreaded function compilation (~90% probability to crash).
If it does NOT crash on first compilation - it will likely not crash at all, i was running properly working threaded code for >20 minutes.
But every recompilation (if i change julia function code) will likely result in a crash.
And also to note: when it does not crash - multithreaded code works as it supposed to and yields the correct result.

P.S. I’ve also managed to run it with ddd, but got the same stacktrace as posted above - it all crashes on jl_gc_safepoint_ ducing gc calls while compiling

One thing that seems missing from your code is the JL_GC_PUSH/JL_GC_POP calls to make sure values returned from API functions don’t get garbage-collected prematurely by subsequent jl_... calls. This might very well cause the issues you’re seeing (although you could get lucky w.r.t. GC and it might be caused by something else). See this section in the manual.

yes, thank you, but the crashes happen way before that all.
And i think push/pop is not needed in this case anyway - returned value from jl_call is only checked for errors and discarded, and all my call arguments are pointing to unmanaged by julia C arrays. But to be triple sure nothing gets GCd while i create those array pointers - i used to just disable garbage collector before first array creation call and reenable it just before jl_call (it’s commented out now)
But yes, i did even try to JL_GC_PUSH literally everything that was castable to jl_value_t* :slight_smile: before understood that bug happens even if there is just 2 calls in the whole program: jl_init and jl_eval_string

Ah, I figured the code you posted was already reduced to the bare minimum needed to show the crash. Which might still be a good idea in order to really dig down into the cause :wink:

started just rewriting it from zero (more like copying in block by block) in order to make an absolute minimal example, but stumbled upon an interesting thing.
in previous gist i had this jl_options.handle_signals = JL_OPTIONS_HANDLE_SIGNALS_OFF; line that i took from some random julia crash-related question when i first encountered crashes. Now without that line crashes still happen, but significantly less often.
I’ve also added back GC rooting for everything

here’s updated code
now things still crash, but less rare, hard to estimate exact probability

  • at line 67 0/40 attempts
  • at line 216 (on first ever encountering that line) 1/40
  • at line 302 ~ 1/15

I don’t know if signals have anything to do with anything for Julia, or is it just a coincidence? maybe if julia does use signals and houdini has it’s own handlers that interfere - that causes the crash?

What’s the way to build this code? I managed to install Houdini 19 on Arch, but using hcompile didn’t work on your earlier version of the code (seems to be missing some metadata?).

im building it with vscode
here’s the repo: https://github.com/pedohorse/yuria/tree/adding-int-attribs
and tasks.json - is the build task for vscode (you’ll have to change some paths and houdini version there)

the problem now is that with these lower crash probabilites - it’s more tedious to cause it :slight_smile:

Okay, managed to build the dso. How do I load/trigger it from within Houdini?

Edit: btw, your tasks.json doesn’t link to any Houdini library, is that correct?

i’ve added a houdini scene here
when you open it - it should drop you into network /obj/geo1 where you see blue+purple circle around node juliasnippet6. that means that node is the “display” one, therefore it’s evaluated for display.

that node has threaded code in upper code parameter - that code is executed at line 216 with jl_eval_string
sometimes it crashes already on load when evaluating that (but chances are low as state above)
then you can switch to juliasnippet1 node (marked as red) - to do so - just toggle the rightmost button on the node itself - it is the display flag.
as you see - that node has threaded julia fractal code, and it has ~5% chance to crash on compile. that code is executed with jl_call at line 303

if it does not crash - you can just start adding ; symbol to the code of the displayed nodes to force code recompilation - eventually it will crash. just be sure to “exit” code editor - houdini nodes update their parameters on editing finish.

hopefully it’s all understandable

as for linking - it does seem weird, but it seem to be the correct way to do it. all those houdini-related flags are provided by their hcustom tool

Ah yes, and don’t forget to set JULIA_NUM_THREADS to something bigger than 1, otherwise it will not crash

(or you can also add jl_options.handle_signals = JL_OPTIONS_HANDLE_SIGNALS_OFF; before init to ensure much more often crashes, hopefully for the same reason)

Well, I’ve been rerunning the scene nodes quite a bit, with varying values for JULIA_NUM_THREADS, tweaking the code, playing the animation, etc. But I’ve only had a single segfault. This is with Houdini 19.0.498.

Looking for other information on the specific calls of the stacktrace you’re seeing lead me to https://github.com/JuliaParallel/MPI.jl/issues/337, which itself references some scary information in the Julia manual here:

Julia requires a few signal to function property. The profiler uses SIGUSR2 for sampling and the garbage collector uses SIGSEGV for threads synchronization. If you are debugging some code that uses the profiler or multiple threads, you may want to let the debugger ignore these signals since they can be triggered very often during normal operations.

So the SIGSEGV you’re seeing during GC might very well be normal (and not get sent when running single-threaded). But it might be that Houdini is intercepting the signal and treating it as a major error (which is not unreasonable). You’re other signal-related tweak above might also play into this, reducing the number of “crashes”. This is just some thing to further check out, but perhaps a Julia dev can chime in at this point on this part of the GC behaviour.

I tried

x=[1]
Threads.@threads for i in 1:100
    global x=hcat(x,size(rand(10000,1000)))
end
println(length(x))

and got for example

70

Doesn’t this indicate a race condition?

ah, true, did not notice that this is not the best example, i just needed something that allocates/frees a bunch of memory, i used hcat to global var to ensure nothing is optimized out.
But if it crashes - it seem to be only crashing on Threads.@threads line, not inside the loop, as can be assumed by a bunch of printlns put outside and inside of the loop.
And it crashes on any threaded loop actually, the contents of the loop does not matter

that is a very interesting point!
thank you for spending your time with all that!
i’m looking for a way to somehow block houdini from processing SIGSEGV, not sure if it’s possible, and cannot understand how is that it “leaks” to houdini in such an undetermined manner.

true - that particular fractal code is not very “crashy”, but a code with more allocations has much more chances to trigger a crash, i should update the bug scene when i have some time