How-to debug intermittent memory corruption errors (segfaults during GC) in large codebases

Hi folks,

I have an interesting (in the Chinese proverb sense) problem here.

While working on SPICE.jl I have been having problems with Julia crashing in the test suite but I had not been able to reproduce it consistently until now.

I have narrowed it down to two parameters:

  • The first one is confusingly the size of the code base. When there is a certain amount of code Julia will crash. What kind of code it is does not seem to matter. More confusingly the problem can also go away when adding more code until it reappears later when the size of the code is “right” again.
  • The inlining pass and/or precompiling seems to blame. I originally could not reproduce the crashes because I did not realize that they would only occur with the exact same flags that the Pkg.test() operation uses to spawn Julia., i.e. the crashes will only occur with --inline=yes and --compiled-modules=yes.

The backtrace and the 30k lines of output produced by running the test suite are here :see_no_evil: https://gist.github.com/helgee/83658705479ac566f15abb3a7e1966a1

The code is on this branch: https://github.com/JuliaAstro/SPICE.jl/tree/segfault

Another “fun” fact: When I pipe the output of the crashing command to a file, Julia will not crash :man_shrugging:

Any ideas? I would like to narrow this a little more before opening an issue but I have no clue how to proceed.

Could it be this Don't use recursion in IncrementalCompact by Keno · Pull Request #30885 · JuliaLang/julia · GitHub?

I got a new backtrace for y’all. This one is from master instead of 1.1 for the previous one.

signal (11): Segmentation fault: 11
in expression starting at /Users/helge/projects/julia/SPICE/test/d.jl:1
jl_gc_pool_alloc at /Users/helge/projects/julia/dev/src/gc.c:1112
jl_gc_alloc_ at /Users/helge/projects/julia/dev/src/./julia_internal.h:263
jl_gc_alloc at /Users/helge/projects/julia/dev/src/gc.c:2911
_new_array_ at /Users/helge/projects/julia/dev/src/array.c:100
_new_array at /Users/helge/projects/julia/dev/src/array.c:160
jl_alloc_array_1d at /Users/helge/projects/julia/dev/src/array.c:420
materialize at ./boot.jl:401 [inlined]
broadcast at ./broadcast.jl:751
- at ./arraymath.jl:39 [inlined]
#isapprox#22 at /Users/helge/projects/julia/dev/usr/share/julia/stdlib/v1.2/LinearAlgebra/src/generic.jl:1390
isapprox at /Users/helge/projects/julia/dev/usr/share/julia/stdlib/v1.2/LinearAlgebra/src/generic.jl:1390
unknown function (ip: 0x11b8da514)
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
eval_test at /Users/helge/projects/julia/dev/usr/share/julia/stdlib/v1.2/Test/src/Test.jl:240
unknown function (ip: 0x11b8bb483)
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
do_call at /Users/helge/projects/julia/dev/src/interpreter.c:323
eval_value at /Users/helge/projects/julia/dev/src/interpreter.c:411
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:635
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
jl_interpret_toplevel_thunk_callback at /Users/helge/projects/julia/dev/src/interpreter.c:884
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x11025988f)
unknown function (ip: 0x314)
jl_interpret_toplevel_thunk at /Users/helge/projects/julia/dev/src/interpreter.c:893
jl_toplevel_eval_flex at /Users/helge/projects/julia/dev/src/toplevel.c:797
jl_parse_eval_all at /Users/helge/projects/julia/dev/src/ast.c:873
jl_load at /Users/helge/projects/julia/dev/src/toplevel.c:859
jl_load_ at /Users/helge/projects/julia/dev/src/toplevel.c:866
include at ./boot.jl:325 [inlined]
include_relative at ./loading.jl:1041
include at ./Base.jl:29
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
include at ./client.jl:443
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
do_call at /Users/helge/projects/julia/dev/src/interpreter.c:323
eval_value at /Users/helge/projects/julia/dev/src/interpreter.c:411
eval_stmt_value at /Users/helge/projects/julia/dev/src/interpreter.c:362
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:754
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:699
jl_interpret_toplevel_thunk_callback at /Users/helge/projects/julia/dev/src/interpreter.c:884
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x1102de28f)
unknown function (ip: 0x2f6)
jl_interpret_toplevel_thunk at /Users/helge/projects/julia/dev/src/interpreter.c:893
jl_toplevel_eval_flex at /Users/helge/projects/julia/dev/src/toplevel.c:797
jl_parse_eval_all at /Users/helge/projects/julia/dev/src/ast.c:873
jl_load at /Users/helge/projects/julia/dev/src/toplevel.c:859
jl_load_ at /Users/helge/projects/julia/dev/src/toplevel.c:866
include at ./boot.jl:325 [inlined]
include_relative at ./loading.jl:1041
include at ./Base.jl:29
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
include at ./client.jl:443
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_fptr_trampoline at /Users/helge/projects/julia/dev/src/gf.c:1895
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
do_call at /Users/helge/projects/julia/dev/src/interpreter.c:323
eval_value at /Users/helge/projects/julia/dev/src/interpreter.c:411
eval_stmt_value at /Users/helge/projects/julia/dev/src/interpreter.c:362
eval_body at /Users/helge/projects/julia/dev/src/interpreter.c:754
jl_interpret_toplevel_thunk_callback at /Users/helge/projects/julia/dev/src/interpreter.c:884
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x10d90058f)
unknown function (ip: 0xffffffffffffffff)
jl_interpret_toplevel_thunk at /Users/helge/projects/julia/dev/src/interpreter.c:893
jl_toplevel_eval_flex at /Users/helge/projects/julia/dev/src/toplevel.c:797
jl_parse_eval_all at /Users/helge/projects/julia/dev/src/ast.c:873
jl_load at /Users/helge/projects/julia/dev/src/toplevel.c:859
jl_load_ at /Users/helge/projects/julia/dev/src/toplevel.c:866
include at ./boot.jl:325 [inlined]
include_relative at ./loading.jl:1041
include at ./Base.jl:29
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
exec_options at ./client.jl:307
_start at ./client.jl:476
jl_fptr_args at /Users/helge/projects/julia/dev/src/gf.c:1905
jl_apply_generic at /Users/helge/projects/julia/dev/src/gf.c:2250
jl_apply at /Users/helge/projects/julia/dev/usr/bin/julia-debug (unknown line)
true_main at /Users/helge/projects/julia/dev/usr/bin/julia-debug (unknown line)
main at /Users/helge/projects/julia/dev/usr/bin/julia-debug (unknown line)
Allocations: 17378241 (Pool: 17375138; Big: 3103); GC: 38

Current code is here: https://github.com/JuliaAstro/SPICE.jl/tree/segfault-1

Unfortunately, these backtraces are not very useful. In most cases, the reason for a intermittent crash in GC is because you are not using the C interface/unsafe API correctly. You see crashes in the GC only because that’s where we detect it and it starts from the compiler because that’s likely where you see the most small size allocations when you’ve optimized your own code. Changing usage pattern can easily make problems like this go away.

3 Likes

Oh boy :see_no_evil: That means I will need to manually inspect all of my 400+ ccalls?

Well, obviously only the ones you use and it in principle shouldn’t be much more work than writing all of them… (not saying that’s trivial work but not more than what you’ve done…).

There’s also rr if you are on linux so you only have to reproduce the crash once. Using it to debug this kind of issue does require some knowledge about the julia internal which is a bit too much to cover in full here…

Just openned a random file to have a look and I can already see invalid code in there. The followine line for example is invalid. The pointer may return something completely random since the array you’ve created via copying can be free before you convert it to a string.
https://github.com/JuliaAstro/SPICE.jl/blob/4b9f183fe0b45ad181aa463b4a348f09e8d66d17/src/l.jl#L406

3 Likes

Thanks for having a look!

Would you mind to elaborate a bit on this? I do not get it. Why could the array be freed?

You can take a look at the usage of GC.@preserve in Base (e.g. https://github.com/JuliaLang/julia/blob/c670f1acdba1971b8545f1c3f3b0cfe55ee0d3f5/stdlib/LibGit2/src/tag.jl#L64-L68)

out[i] = unsafe_string(pointer(items[:,i]))

can be rewritten as

tmp_i = items[:, i]
ptr = pointer(tmp_i)
unsafe_string(ptr)

After the second line, there is nothing keeping tmp_i alive so the GC might free it. The only thing guaranteeing holding variables alive is AFAIU GC.@preserve or cconvert inside ccall.

2 Likes

Julia got a lot cleverer about GC in 1.0 which has the positive side effect of making the language much more efficient, but also has the side effect of making a lot of slightly sketchy C interop code like this much more likely to actually cause a crash than before.

1 Like

Thanks a lot, folks!

So the proper way to write the function linked above would be this?

function lparsm(list, delims, nmax, lenout)
    n = Ref{SpiceInt}()
    items = Array{UInt8}(undef, lenout, nmax)
    GC.@preserve items begin
        ccall((:lparsm_c, libcspice), Cvoid, (Cstring, Cstring, SpiceInt, SpiceInt, Ref{SpiceInt}, Ptr{UInt8}),
               list, delims, nmax, lenout, n , items)
        handleerror()
        out = String[unsafe_string(pointer(items[:,i])) for i in 1:n[]]
    end
    out
end

EDIT: Actually the correct way would be split(list). This specific function does not neet to be wrapped at all :stuck_out_tongue_winking_eye:

Pretty sure no,

pointer(items[:,i])

is still wrong because items[:,i] it creates a new array that you take the pointer to but this new array is not protected from GC. You need something like

out = String[]
for i in 1:n[]
    items_i = items[:,i]
    GC.@preserve items_i begin
        push!(out, unsafe_string(pointer(items_i)))
    end
end

I guess you are fine only protecting items as long as you use a view? Also, there is no need for the @preserve also wrapping the ccall, the conversion from items to Ptr{UInt8} is fine because as I said

2 Likes

Now I get it! The fact that indexing allocates a new array was the missing piece :+1:

1 Like

function chararray_to_strings(array, nmax=-1)
strings = String[]
m, n = size(array)
n = nmax == -1 ? n : nmax
n == 0 && return strings
GC.@preserve array begin
for i = 1:n
ptr = pointer(array, (i - 1) * m + 1)
push!(strings, unsafe_string(ptr))
end
end
strings
end

Why do you need unsafe_string at all here?

I don’t, awesome!

function chararray_to_strings(array, nmax=-1)
    strings = String[]
    m, n = size(array)
    n = nmax == -1 ? n : nmax
    n == 0 && return strings
    for i = 1:n
        line = array[:, i]
        idx = findfirst(iszero, line) - 1
        push!(strings, String(line[1:idx]))
    end
    strings
end

Again, thanks for your help so far. I applied all the tips above but still see segfaults. The remaining use of raw pointers in the code base is this: SPICE.jl/cells.jl at main · JuliaAstro/SPICE.jl · GitHub

Is this pattern safe? Do I need an explicit finalizer?

I think I found the culprit: One of the C functions I called expected an array of a certain length but the magic number was hidden deep in the documentation :man_shrugging:

I will leave some hints here how to debug these kind of problems and rename the thread to make it easier to find.

How to debug intermittent memory corruption errors

  1. Look for uses of unsafe functions and check whether they follow the GC safety rules (see above).
  2. If the problem persists, narrow it down to a specific ccall by selectively increasing memory pressure. This requires a comprehensive test suite.
    1. If your test suite consists of several files, comment out all but one include and put a for loop around it to run the tests in that file several times. If this triggers the segfault, move to the next step.
    2. Rinse and repeat for individual test sets/tests, i.e. run only a single test 10000 times.
1 Like