Hello! I’m very new to julia, but immediately tried to embed it into a node-based CG software (sidefx houdini) for geometry processing.
everything worked perfectly fine, until i ran a multithreaded loop that did a bunch of allocations. in that case it has observed 90% chance to crash mostly without stack trace.
- I am sure all julia-related functions are always called from the same thread (though parent software itself is multithreaded)
- i tried rooting literally every variable that cases to jl_value_t* - it had no effect on crashes
- i tried disabling gc with calling
and that totally solves crashes, but i cannot afford running code without garbage collection.
- in the end i could reduce it to a single call that crashes 90% of the time if more than one thread:
code is this below:
@time Threads.@threads for i in 1:100
if i either don’t use Threads module, or specify
JULIA_NUM_THREADS=1 - nothing ever crashes
attaching some random generated traces (mostly it crashes with no messages, nothign at all)
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 9 3950X 16-Core Processor
LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
JULIA_NUM_THREADS = 32
also tried julia 1.6.5 - same, tried different versions of houdini - same problem.
and obviously there is no problems running any of that threaded code in REPL.
forgot to mention: IF code doesn’t crash on the first run - it will run fine after. some code (with not much allocations) i was running in a loop for 20 minutes - it never crashed. another more complicated code (that uses LinearAlgebra module) had a small chance to crash on incoming data size change. That brings me to think that compiler together with gc are involved here, or it’s just generally much more gc calls when compiler is involved.
my skills are not enough to debug it, and i have no idea at this point where to look
It seems Houdini’s own signal/segfault handler is active. Maybe if you disable it (as per Automatic Crash Recovery in their docs) you might get more detailed stack trace information by inspecting the core dump file? Just an idea.
tried that just now. it still mostly crashes without any messages at all, observed probability of a stacktrace on crash is <5%, don’t know what cosmic powers affect that.
for some reason i cannot make it to generate a coredump file either, though i did set
ulimit -c unlimited in bash, and managed to coredump some test programs. but again, my expertise in debugging goes as far as printf
Something to check is if
jl_eval_string() is called from the same thread as where
jl_init() is called during plugin initialization. You can (probably) call
printf("%d\n", pthread_self()) and check if the IDs are the same.
I’m mostly a Blender user and briefly tried to see if installing Houdini would be easy, but it’s not (as I’m on Arch Linux, which it doesn’t support).
yes, it is the same thread, i am confident, and i’ve checked it right now once more with what you suggested)
to be mega sure i just cramped both init and evaluation into one function - the outcome is the same.
i’ve also tried (out of desperation) using global thread lock i found in HDK just to make sure no other houdini threads are doing anything at that moment (though i cannot be sure what does that mutex lock exactly, as it is very poorly documented). And that had no effect on the problem
also attaching the list of gcc flags used, just in case
most of them come from HDK’s requirements (including a bunch of defines)
and just last 2 is directly julia-related
I’ve been trying to get a simple multi-threaded example to crash on my end here, but no luck so far. For example:
// gcc -pthread -o t_threading `julia /usr/share/julia/julia-config.jl --cflags --ldflags --ldlibs` t_threading.c
jl_eval_string("x=; @time Threads.@threads for i in 1:100 global x=hcat(x,size(rand(10000,1000))); end");
pthread_create(&t, NULL, func, NULL);
Note that there are some implied restrictions when embedding in a multi-threaded environment. The docs don’t mentioned these, but threads like this one suggest that calling Julia API functions from a thread that was not started by Julia is unsupported, and doing so could lead to all kinds of nasty behaviour. So to put simply, you should only call
jl_... functions from either the thread in which
jl_init() was called, or from a thread that was started by Julia (and not from some other thread you started). Can you show some of your code to see what it does?
And does this segfault occur already the first time you run your plugin? Or does it take multiple runs?
I am very sure all julia calls are happening from the same thread.
here’s the whole code (i’ll put it in a proper repo a bit later)
there’s a lot of boiler plate, so important points are:
- 25: newSopOperator - is an entry point for the plugin
- 62: the constructor of the node object - here
jl_init() is being called the first time the first node is created (i moved it around for tests everywhere possible - no effect on the problem)
- 98: cookMySop - this method is called every time node needs to process geometry. There’s a bunch of houdini-related things, but important parts are following:
- 214: julia function code evaluation. It is even here (or literally anywhere at all, in constructor or in cookMySop) if i execute said threaded loop - it will crash.
It doesnt even matter what happens next, since it crashes even here, but just to name it:
- 246: here a vector of arguments for next jl_call is declared, and
- 283: is where jl_call happens
But again, it will crash on threaded code even on line 214, or even if i put this
jl_eval_string on line 99, the very beginning of cookMySop, or hell, even if i put it in constructor, right after
jl_init call on line 66
so yeah, even if i call
jl_eval_string right after
jl_init in the constructor (or even in plugin hook) - it will crash
About if this segfault occur already the first time: - it has high chance to occur on first multithreaded function compilation (~90% probability to crash).
If it does NOT crash on first compilation - it will likely not crash at all, i was running properly working threaded code for >20 minutes.
But every recompilation (if i change julia function code) will likely result in a crash.
And also to note: when it does not crash - multithreaded code works as it supposed to and yields the correct result.
P.S. I’ve also managed to run it with ddd, but got the same stacktrace as posted above - it all crashes on
jl_gc_safepoint_ ducing gc calls while compiling
One thing that seems missing from your code is the
JL_GC_POP calls to make sure values returned from API functions don’t get garbage-collected prematurely by subsequent
jl_... calls. This might very well cause the issues you’re seeing (although you could get lucky w.r.t. GC and it might be caused by something else). See this section in the manual.
yes, thank you, but the crashes happen way before that all.
And i think push/pop is not needed in this case anyway - returned value from
jl_call is only checked for errors and discarded, and all my call arguments are pointing to unmanaged by julia C arrays. But to be triple sure nothing gets GCd while i create those array pointers - i used to just disable garbage collector before first array creation call and reenable it just before
jl_call (it’s commented out now)
But yes, i did even try to JL_GC_PUSH literally everything that was castable to jl_value_t* before understood that bug happens even if there is just 2 calls in the whole program:
Ah, I figured the code you posted was already reduced to the bare minimum needed to show the crash. Which might still be a good idea in order to really dig down into the cause
started just rewriting it from zero (more like copying in block by block) in order to make an absolute minimal example, but stumbled upon an interesting thing.
in previous gist i had this
jl_options.handle_signals = JL_OPTIONS_HANDLE_SIGNALS_OFF; line that i took from some random julia crash-related question when i first encountered crashes. Now without that line crashes still happen, but significantly less often.
I’ve also added back GC rooting for everything
here’s updated code
now things still crash, but less rare, hard to estimate exact probability
- at line 67 0/40 attempts
- at line 216 (on first ever encountering that line) 1/40
- at line 302 ~ 1/15
I don’t know if signals have anything to do with anything for Julia, or is it just a coincidence? maybe if julia does use signals and houdini has it’s own handlers that interfere - that causes the crash?
What’s the way to build this code? I managed to install Houdini 19 on Arch, but using
hcompile didn’t work on your earlier version of the code (seems to be missing some metadata?).
im building it with vscode
here’s the repo: https://github.com/pedohorse/yuria/tree/adding-int-attribs
and tasks.json - is the build task for vscode (you’ll have to change some paths and houdini version there)
the problem now is that with these lower crash probabilites - it’s more tedious to cause it
Okay, managed to build the dso. How do I load/trigger it from within Houdini?
Edit: btw, your tasks.json doesn’t link to any Houdini library, is that correct?
i’ve added a houdini scene here
when you open it - it should drop you into network
/obj/geo1 where you see blue+purple circle around node juliasnippet6. that means that node is the “display” one, therefore it’s evaluated for display.
that node has threaded code in upper code parameter - that code is executed at line 216 with
sometimes it crashes already on load when evaluating that (but chances are low as state above)
then you can switch to
juliasnippet1 node (marked as red) - to do so - just toggle the rightmost button on the node itself - it is the display flag.
as you see - that node has threaded julia fractal code, and it has ~5% chance to crash on compile. that code is executed with
jl_call at line 303
if it does not crash - you can just start adding
; symbol to the code of the displayed nodes to force code recompilation - eventually it will crash. just be sure to “exit” code editor - houdini nodes update their parameters on editing finish.
hopefully it’s all understandable
as for linking - it does seem weird, but it seem to be the correct way to do it. all those houdini-related flags are provided by their
Ah yes, and don’t forget to set
JULIA_NUM_THREADS to something bigger than 1, otherwise it will not crash
(or you can also add
jl_options.handle_signals = JL_OPTIONS_HANDLE_SIGNALS_OFF; before init to ensure much more often crashes, hopefully for the same reason)
Well, I’ve been rerunning the scene nodes quite a bit, with varying values for
JULIA_NUM_THREADS, tweaking the code, playing the animation, etc. But I’ve only had a single segfault. This is with Houdini 19.0.498.
Looking for other information on the specific calls of the stacktrace you’re seeing lead me to https://github.com/JuliaParallel/MPI.jl/issues/337, which itself references some scary information in the Julia manual here:
Julia requires a few signal to function property. The profiler uses
SIGUSR2 for sampling and the garbage collector uses
SIGSEGV for threads synchronization. If you are debugging some code that uses the profiler or multiple threads, you may want to let the debugger ignore these signals since they can be triggered very often during normal operations.
So the SIGSEGV you’re seeing during GC might very well be normal (and not get sent when running single-threaded). But it might be that Houdini is intercepting the signal and treating it as a major error (which is not unreasonable). You’re other signal-related tweak above might also play into this, reducing the number of “crashes”. This is just some thing to further check out, but perhaps a Julia dev can chime in at this point on this part of the GC behaviour.
Threads.@threads for i in 1:100
and got for example
Doesn’t this indicate a race condition?
ah, true, did not notice that this is not the best example, i just needed something that allocates/frees a bunch of memory, i used hcat to global var to ensure nothing is optimized out.
But if it crashes - it seem to be only crashing on
Threads.@threads line, not inside the loop, as can be assumed by a bunch of
printlns put outside and inside of the loop.
And it crashes on any threaded loop actually, the contents of the loop does not matter
that is a very interesting point!
thank you for spending your time with all that!
i’m looking for a way to somehow block houdini from processing SIGSEGV, not sure if it’s possible, and cannot understand how is that it “leaks” to houdini in such an undetermined manner.
true - that particular fractal code is not very “crashy”, but a code with more allocations has much more chances to trigger a crash, i should update the bug scene when i have some time