TLDR:
if the code is segfaulting outside Ccalls AFTER Ccalls in threaded GC, it might be that the C library has loaded a custom signal handler and is competing with julia’s handler.
The solution is something like
"""
sigsegv_handler(new=C_NULL, old=zeros(UInt8, 256); signal) -> typeof(new)
new is the new signal handler to set for the signal and old is the structure where the backup will be made. setting each to `C_NULL` will ignore them. i.e. by default it will just backup the handler. The arguments need to be able to hold a `struct sigaction`.
"""
function sigsegv_handler(new=C_NULL, old=zeros(UInt8, 256); signal)
@ccall sigaction(
signal::Cint,
new::Ptr{Cvoid},
old::Ptr{Cvoid})::Cint
return old
end
handler = sigsegv_hanlder(; signal=11)
@ccall C code ...
sigsegv_hanlder(handler, C_NULL; signal=11)
Long version
One of my go to options for blackbox optimization is nomad. The biggest issue with nomad is the julia interface is this issue. Long story short the interface works fine in single thread but using any sort of threads after or while using nomad makes the problem segfault. After some talk mostly with @mkitti, @Sukera, and @jpsamaroo this was the cause and building on mkitti’s idea, the solution.
The first guess was that the call back was being run from another thread. This cannot be as NOMAD_jll is compiled w/o openmp support and it would not explain why it segfaults AFTER using nomad. After some debugging with gdb i came to the conlusion that it must be the signal hanlders that have been messed with. Lo and behold it was indeed true. Nomad loads signal handlers but never releases them.
that’s when mkitti had the brilliant idea to just back it up on julia’s side with sigaction. The solution was indeed using it back up the state and restoring before calling the user function and when going out of NOMAD.jl
. The only remaining problem right now is having a robust estimate of the struct size, 256 Bytes seems like a good estimate as it’s 159 Bytes on x86_64.
#include <signal.h>
#include <stdio.h>
int main() {
printf("%ld", sizeof(struct sigaction));
return 0;
}
or the more questionable version
function get_sigsegv_handler()
v = zeros(UInt8, 1024)
@ccall sigaction(
11::Cint,
C_NULL::Ptr{Cvoid},
v::Ptr{Cvoid}
)::Cint
return v
end
findlast(x -> x != 0, get_sigsegv_handler())
can be used to calculate the size needed for storing the struct.
Whether @ccall
/@cfunction
should automagically restore the hanlder is another question and i hope someone with more understand can chip in.
PS.
i think it’s bad design from the part of bbopt/nomad to not restore the handlers. but even if they did restore it, it wouldn’t have solved half of the problems.
MWE
using Base.Threads, NOMAD
n = 5
@assert nthreads() > 1
function main()
function f(x)
@threads for i in 1:nthreads() * 3
for j in 1:100
g = rand(j, j)
end
GC.gc()
end
(true, true, x[1])
end
pb = NomadProblem(n,
1,
["OBJ"],
f;
upper_bound=[100.0 for _ in 1:n],
lower_bound=[0.0 for _ in 1:n])
pb.options.max_bb_eval = 1000
pb.options.max_time = 20
result = @time NOMAD.solve(pb, rand(n))
end
main()
@threads for i in 1:100
for j in 1:100
g = rand(j, j)
end
GC.gc()
end