I have an interesting (in the Chinese proverb sense) problem here.
While working on SPICE.jl I have been having problems with Julia crashing in the test suite but I had not been able to reproduce it consistently until now.
I have narrowed it down to two parameters:
The first one is confusingly the size of the code base. When there is a certain amount of code Julia will crash. What kind of code it is does not seem to matter. More confusingly the problem can also go away when adding more code until it reappears later when the size of the code is “right” again.
The inlining pass and/or precompiling seems to blame. I originally could not reproduce the crashes because I did not realize that they would only occur with the exact same flags that the Pkg.test() operation uses to spawn Julia., i.e. the crashes will only occur with --inline=yes and --compiled-modules=yes.
I got a new backtrace for y’all. This one is from master instead of 1.1 for the previous one.
signal (11): Segmentation fault: 11
in expression starting at /Users/helge/projects/julia/SPICE/test/d.jl:1
jl_gc_pool_alloc at /Users/helge/projects/julia/dev/src/gc.c:1112
jl_gc_alloc_ at /Users/helge/projects/julia/dev/src/./julia_internal.h:263
jl_gc_alloc at /Users/helge/projects/julia/dev/src/gc.c:2911
_new_array_ at /Users/helge/projects/julia/dev/src/array.c:100
_new_array at /Users/helge/projects/julia/dev/src/array.c:160
jl_alloc_array_1d at /Users/helge/projects/julia/dev/src/array.c:420
materialize at ./boot.jl:401 [inlined]
broadcast at ./broadcast.jl:751
- at ./arraymath.jl:39 [inlined]
#isapprox#22 at /Users/helge/projects/julia/dev/usr/share/julia/stdlib/v1.2/LinearAlgebra/src/generic.jl:1390
isapprox at /Users/helge/projects/julia/dev/usr/share/julia/stdlib/v1.2/LinearAlgebra/src/generic.jl:1390
unknown function (ip: 0x11b8da514)
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
eval_test at /Users/helge/projects/julia/dev/usr/share/julia/stdlib/v1.2/Test/src/Test.jl:240
unknown function (ip: 0x11b8bb483)
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
do_call at /Users/helge/projects/julia/dev/src/interpreter.c:323
eval_value at /Users/helge/projects/julia/dev/src/interpreter.c:411
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:635
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
jl_interpret_toplevel_thunk_callback at /Users/helge/projects/julia/dev/src/interpreter.c:884
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x11025988f)
unknown function (ip: 0x314)
jl_interpret_toplevel_thunk at /Users/helge/projects/julia/dev/src/interpreter.c:893
jl_toplevel_eval_flex at /Users/helge/projects/julia/dev/src/toplevel.c:797
jl_parse_eval_all at /Users/helge/projects/julia/dev/src/ast.c:873
jl_load at /Users/helge/projects/julia/dev/src/toplevel.c:859
jl_load_ at /Users/helge/projects/julia/dev/src/toplevel.c:866
include at ./boot.jl:325 [inlined]
include_relative at ./loading.jl:1041
include at ./Base.jl:29
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
include at ./client.jl:443
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
do_call at /Users/helge/projects/julia/dev/src/interpreter.c:323
eval_value at /Users/helge/projects/julia/dev/src/interpreter.c:411
eval_stmt_value at /Users/helge/projects/julia/dev/src/interpreter.c:362
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:754
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
jl_interpret_toplevel_thunk_callback at /Users/helge/projects/julia/dev/src/interpreter.c:884
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x1102de28f)
unknown function (ip: 0x2f6)
jl_interpret_toplevel_thunk at /Users/helge/projects/julia/dev/src/interpreter.c:893
jl_toplevel_eval_flex at /Users/helge/projects/julia/dev/src/toplevel.c:797
jl_parse_eval_all at /Users/helge/projects/julia/dev/src/ast.c:873
jl_load at /Users/helge/projects/julia/dev/src/toplevel.c:859
jl_load_ at /Users/helge/projects/julia/dev/src/toplevel.c:866
include at ./boot.jl:325 [inlined]
include_relative at ./loading.jl:1041
include at ./Base.jl:29
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
include at ./client.jl:443
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_fptr_trampoline at /Users/helge/projects/julia/dev/src/gf.c:1895
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
do_call at /Users/helge/projects/julia/dev/src/interpreter.c:323
eval_value at /Users/helge/projects/julia/dev/src/interpreter.c:411
eval_stmt_value at /Users/helge/projects/julia/dev/src/interpreter.c:362
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:754
jl_interpret_toplevel_thunk_callback at /Users/helge/projects/julia/dev/src/interpreter.c:884
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x10d90058f)
unknown function (ip: 0xffffffffffffffff)
jl_interpret_toplevel_thunk at /Users/helge/projects/julia/dev/src/interpreter.c:893
jl_toplevel_eval_flex at /Users/helge/projects/julia/dev/src/toplevel.c:797
jl_parse_eval_all at /Users/helge/projects/julia/dev/src/ast.c:873
jl_load at /Users/helge/projects/julia/dev/src/toplevel.c:859
jl_load_ at /Users/helge/projects/julia/dev/src/toplevel.c:866
include at ./boot.jl:325 [inlined]
include_relative at ./loading.jl:1041
include at ./Base.jl:29
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
exec_options at ./client.jl:307
_start at ./client.jl:476
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
jl_apply at /Users/helge/projects/julia/dev/usr/bin/julia-debug (unknown line)
true_main at /Users/helge/projects/julia/dev/usr/bin/julia-debug (unknown line)
main at /Users/helge/projects/julia/dev/usr/bin/julia-debug (unknown line)
Allocations: 17378241 (Pool: 17375138; Big: 3103); GC: 38
Unfortunately, these backtraces are not very useful. In most cases, the reason for a intermittent crash in GC is because you are not using the C interface/unsafe API correctly. You see crashes in the GC only because that’s where we detect it and it starts from the compiler because that’s likely where you see the most small size allocations when you’ve optimized your own code. Changing usage pattern can easily make problems like this go away.
Well, obviously only the ones you use and it in principle shouldn’t be much more work than writing all of them… (not saying that’s trivial work but not more than what you’ve done…).
There’s also rr if you are on linux so you only have to reproduce the crash once. Using it to debug this kind of issue does require some knowledge about the julia internal which is a bit too much to cover in full here…
After the second line, there is nothing keeping tmp_i alive so the GC might free it. The only thing guaranteeing holding variables alive is AFAIU GC.@preserve or cconvert inside ccall.
Julia got a lot cleverer about GC in 1.0 which has the positive side effect of making the language much more efficient, but also has the side effect of making a lot of slightly sketchy C interop code like this much more likely to actually cause a crash than before.
So the proper way to write the function linked above would be this?
function lparsm(list, delims, nmax, lenout)
n = Ref{SpiceInt}()
items = Array{UInt8}(undef, lenout, nmax)
GC.@preserve items begin
ccall((:lparsm_c, libcspice), Cvoid, (Cstring, Cstring, SpiceInt, SpiceInt, Ref{SpiceInt}, Ptr{UInt8}),
list, delims, nmax, lenout, n , items)
handleerror()
out = String[unsafe_string(pointer(items[:,i])) for i in 1:n[]]
end
out
end
EDIT: Actually the correct way would be split(list). This specific function does not neet to be wrapped at all
is still wrong because items[:,i] it creates a new array that you take the pointer to but this new array is not protected from GC. You need something like
out = String[]
for i in 1:n[]
items_i = items[:,i]
GC.@preserve items_i begin
push!(out, unsafe_string(pointer(items_i)))
end
end
I guess you are fine only protecting items as long as you use a view? Also, there is no need for the @preserve also wrapping the ccall, the conversion from items to Ptr{UInt8} is fine because as I said
function chararray_to_strings(array, nmax=-1)
strings = String[]
m, n = size(array)
n = nmax == -1 ? n : nmax
n == 0 && return strings
GC.@preserve array begin
for i = 1:n
ptr = pointer(array, (i - 1) * m + 1)
push!(strings, unsafe_string(ptr))
end
end
strings
end
function chararray_to_strings(array, nmax=-1)
strings = String[]
m, n = size(array)
n = nmax == -1 ? n : nmax
n == 0 && return strings
for i = 1:n
line = array[:, i]
idx = findfirst(iszero, line) - 1
push!(strings, String(line[1:idx]))
end
strings
end
I think I found the culprit: One of the C functions I called expected an array of a certain length but the magic number was hidden deep in the documentation
I will leave some hints here how to debug these kind of problems and rename the thread to make it easier to find.
How to debug intermittent memory corruption errors
Look for uses of unsafe functions and check whether they follow the GC safety rules (see above).
If the problem persists, narrow it down to a specific ccall by selectively increasing memory pressure. This requires a comprehensive test suite.
If your test suite consists of several files, comment out all but one include and put a for loop around it to run the tests in that file several times. If this triggers the segfault, move to the next step.
Rinse and repeat for individual test sets/tests, i.e. run only a single test 10000 times.