GC error (probable corruption)

I’m stuggling to reproduce a gc related error on Julia v0.7 and v1.0, Windows x86. (But notably, not Julia v0.6.)

The tests for GLPK.jl fail with the following error. (Here is the Appveyor log.)

Allocations: 2388187 (Pool: 2388000; Big: 187); GC: 12
GC error (probable corruption) :
!!! ERROR in jl_ -- ABORTING !!!
1474F020: Queued root: 0D2D6AD0 :: 0D000050 (bits: 3)
        of type Core.TypeName
1474F02C: Queued root: 0DD30E80 :: 0D000050 (bits: 3)
        of type Core.TypeName
... many more lines like this ...

We’re using the new Appveyor.jl script. For some reason, adding the following prior to - echo "%JL_TEST_SCRIPT%" results in the tests passing without deprecation warnings.
- C:\julia\bin\julia -e "VERSION >= v\"0.7-\" && (using Pkg; Pkg.add(\"LinQuadOptInterface\"); using LinQuadOptInterface)"
(LinQuadOptInterface is a dependency package.)

I’ve tried various things to reproduce this locally without success.

Any ideas what this is, where it might come from, or how to go about fixing?

1 Like

:smiling_face_with_tear:It happens today that I encounter the same issue.

julia> const mst = build_2ssp!(sub, t, T, tks);
GC error (probable corruption)
Allocations: 101110938 (Pool: 101110358; Big: 580); GC: 50
<?#0x7b288a390990::(nil)>

[2908054] signal 6 (-6): Aborted
in expression starting at REPL[46]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gc_dump_queue_and_abort at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gc-stock.c:1664
gc_mark_outrefs at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gc-stock.c:2365 [inlined]
gc_mark_and_steal at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gc-stock.c:2567
gc_mark_loop_parallel at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gc-stock.c:2722 [inlined]
jl_parallel_gc_threadfun at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gc-stock.c:3644
unknown function (ip: 0x7b348ee9caa3) at /lib/x86_64-linux-gnu/libc.so.6
unknown function (ip: 0x7b348ef29c6b) at /lib/x86_64-linux-gnu/libc.so.6
Allocations: 101110938 (Pool: 101110358; Big: 580); GC: 50
./j: line 1: 2908054 Aborted                 (core dumped) julia --project=. --threads=255,1

The behavior is like:

If I start a fresh julia REPL, and run my code, then I’ll not encounter such an error. But if I reuse the REPL that already contains my code, and I drop the same code to that REPL for the second time, then it almost surely trigger this error.

My code is like

include("xx.jl")
import ...
function f()
...
end
f()

If you copy-paste that block to an REPL, for the second time, then the error would occur.

By the way, a healthy execution is like (if measured with @time)

julia> const mst = build_2ssp!(sub, t, T, S, tks);
1728.963785 seconds (132.69 G allocations: 3.919 TiB, 85.34% gc time, 57.22% compilation time: <1% of which was recompilation)

I put my source code on my repo (the full code is there, as shown in the links below) so anyone can read it. If you have Gurobi then you can run it.

This line can trigger the error almost surely.

The most related suspicious code are at

But I don’t think I did anything wrong, according to my experience.

I used multithreading with @spawn on my linux machine with 256 virtual processors. And I made a bunch of use of Gurobi’s C-API.

The most likely cause here is a mistake in your ccalls (e.g. you forgot a gc preserve or something)

1 Like

Yeah I believe this was an issue in GLPK when we recovered from an interrupt; it left the internal model in an unsafe state. Long since fixed

I made some use of Gurobi’s C-API, but in the form of julia function wrappers shipped by Gurobi.jl (the ccalls are defined therein). And on my first fresh run, everything seems normal.

@odow my code is a Benders decomposition on a 2SSP problem. On my first fresh run the algorithm appears to be normal

julia> for k=1:20
           multi_solve(tks, sub, mst)
           proceed, vioMean, ub = bwd_in_sequential(sub, mst, common)
           lb, common = get_trial!(mst)
           agap = ub - lb
           rgap = agap/ub
           println("lb = $lb, agap = $agap, vioMean=$vioMean, rgap = $rgap")
       end
lb = 5800.542489373332, agap = 13016.850062323694, vioMean=13016.850062323696, rgap = 0.6917456829665694
lb = 5800.542489373333, agap = 287.49409274634036, vioMean=287.4940927463401, rgap = 0.04722279323857864
lb = 5800.542489373333, agap = 90.50753003175123, vioMean=90.50753003175153, rgap = 0.015363565023827663
lb = 5800.542489373334, agap = 69.28624517584831, vioMean=69.28624517584983, rgap = 0.011803793314792456
lb = 5800.542489373334, agap = 41.59680711598503, vioMean=41.59680711598685, rgap = 0.007120132712511929
lb = 5801.660608885939, agap = 86.74651366101898, vioMean=87.86463317362389, rgap = 0.014731745250572593
lb = 5803.019760603801, agap = 44.489640759772556, vioMean=45.84879247763274, rgap = 0.00760830598226966
lb = 5806.637099984512, agap = 18.761448730376287, vioMean=22.378788111087488, rgap = 0.0032206292107713652
lb = 5807.112082577849, agap = 10.513653992418767, vioMean=10.988636585756467, rgap = 0.0018072070065162018
lb = 5809.057606162431, agap = 17.796139043714902, vioMean=19.74166262829719, rgap = 0.0030541592121401877
lb = 5809.358171007881, agap = 6.380721145800635, vioMean=6.681285991249145, rgap = 0.0010971471147731891
lb = 5809.73359846487, agap = 4.264219228016373, vioMean=4.6396466850067855, rgap = 0.000733440114999648
lb = 5810.111134613399, agap = 2.222963515127958, vioMean=2.6004996636587143, rgap = 0.0003824562521008065
lb = 5810.548465937093, agap = 0.8144561931358112, vioMean=1.2517875168329435, rgap = 0.00014014891240646555
lb = 5810.637842505216, agap = 2.186196595058391, vioMean=2.275573163182647, rgap = 0.0003760988773017765
lb = 5810.731899858052, agap = 0.30446845109236165, vioMean=0.3985258039310793, rgap = 5.23948624298412e-5
lb = 5810.794026566477, agap = 1.153930180745192, vioMean=1.216056889169522, rgap = 0.0001985444792921052
lb = 5810.805877360574, agap = 0.4342613856706521, vioMean=0.4461121797700495, rgap = 7.472783352648415e-5
lb = 5810.813777252803, agap = 0.10374402782144898, vioMean=0.1116439200516955, rgap = 1.7853295532335417e-5
lb = 5810.817069648572, agap = 0.09195715511395974, vioMean=0.0952495508836364, rgap = 1.5824917356267948e-5

So I think the ccalls are working properly.

But when I drop my code for the 2nd time to the same REPL it just crashes…
Here is another scene of crash

julia> const mst = build_2ssp!(sub, t, T, tks);
 here, before >>

[2927051] signal 11 (128): Segmentation fault
in expression starting at REPL[30]:1
jl_svecref at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/julia.h:1309 [inlined]
jl_is_va_tuple at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/julia_internal.h:1169 [inlined]
obviously_unequal at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/subtype.c:466
ijl_types_equal at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/subtype.c:2325
jl_smallintset_lookup at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/smallintset.c:137
jl_specializations_get_linfo_ at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gf.c:183
cache_method at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gf.c:1587
jl_mt_assoc_by_type at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gf.c:1869
jl_lookup_generic_ at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gf.c:4176 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/gf.c:4210
#mst!#126 at /home/amd/julia_projects/uc/Uc/src/General.jl:337
mst! at /home/amd/julia_projects/uc/Uc/src/General.jl:205
unknown function (ip: 0x74082d5de7b4) at (unknown file)
jl_apply at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/julia.h:2391 [inlined]
jl_f__apply_iterate at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/builtins.c:868
model! at /home/amd/julia_projects/uc/Uc/src/General.jl:12
macro expansion at ./timing.jl:697 [inlined]
build_2ssp! at ./REPL[9]:10
unknown function (ip: 0x74082d55a854) at (unknown file)
jl_apply at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/julia.h:2391 [inlined]
do_call at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/interpreter.c:123
eval_value at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/interpreter.c:243
eval_stmt_value at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/interpreter.c:194 [inlined]
eval_body at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/interpreter.c:707
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/interpreter.c:898
jl_toplevel_eval_flex at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/toplevel.c:1035
__repl_entry_eval_expanded_with_loc at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:301
jl_apply at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/julia.h:2391 [inlined]
jl_f_invokelatest at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/builtins.c:881
toplevel_eval_with_hooks at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:308
toplevel_eval_with_hooks at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:312
toplevel_eval_with_hooks at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:312
toplevel_eval_with_hooks at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:305 [inlined]
eval_user_input at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:330
repl_backend_loop at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:452
#start_repl_backend#41 at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:427
start_repl_backend at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:424 [inlined]
#run_repl#50 at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:653
run_repl at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/usr/share/julia/stdlib/v1.12/REPL/src/REPL.jl:639
jfptr_run_repl_19665.1 at /home/amd/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/share/julia/compiled/v1.12/REPL/u0gqU_E4m7X.so (unknown line)
run_std_repl at ./client.jl:478
jfptr_run_std_repl_24985.1 at /home/amd/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/julia.h:2391 [inlined]
jl_f_invokelatest at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/builtins.c:881
run_main_repl at ./client.jl:499
repl_main at ./client.jl:586 [inlined]
_start at ./client.jl:561
jfptr__start_63319.1 at /home/amd/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/julia.h:2391 [inlined]
true_main at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/jlapi.c:971
jl_repl_entrypoint at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/src/jlapi.c:1139
main at /cache/build/builder-amdci5-4/julialang/julia-release-1-dot-12/cli/loader_exe.c:58
unknown function (ip: 0x7411b082a1c9) at /lib/x86_64-linux-gnu/libc.so.6
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8) at /workspace/srcdir/glibc-2.17/csu/../sysdeps/x86_64/start.S
Allocations: 93899468 (Pool: 93898934; Big: 534); GC: 46
./j: line 1: 2927051 Segmentation fault      (core dumped) julia --project=. --threads=255,1
usr@usr:~/julia_projects/uc/Uc$ 

I have no idea where is going wrong. You can take a look at my code if you have free time. Thanks.
The fortunate thing is that currently it won’t crash on my first-time fresh run so I can proceed developing my algorithm.

mst! has notably plentiful @inbounds calls, which could be causing corruption somewhere. Check this in a process with the flag --check-bounds=yes. If the 2nd paste’s crash is as reliable as you say, just a few runs should be informative.

Earlier you mentioned multithreading, and if you’re concerned about a race condition, you could then check a process with the flag --threads=1. I wonder about the runtime stopping you from reaching a crash though, it takes 256 threads almost half an hour to complete.

1 Like

Thanks. The error still exits when I start julia with julia --project=. --threads=1,0 --check-bounds=yes. So it’s not likely due to the use of @inbounds.

But when under this starting configuration, the error message is too informative to be intercepted (my VSCode has a maximum scroll back of 10000 lines, yet cannot contain that full error message). It just flushes my screen rapidly.

That runtime was derived under a large-scale test (while in my small-scaled script, S=3), it’s likely less than 20 seconds.


Overall I think this is a serious issue worth looking into carefully. Just by running a single line of code in julia REPL, it can completely crash my bash terminal—the error message is printed rapidly and endlessly. The exact line of Error is definitely

mst = General.model!(sub, t, T, tks, Line.P, F, WD, LD, GD, uH, xH, pAH)

(It works normally on a first-time run. But crashes steadily on the second run if you overwrite all the script beforehand).

That line was about building some new JuMP models—I cannot figure out where is wrong.

As clarified above, the issue persists under --threads=1,0 --check-bounds=yes. So it’s probably not caused by a “race condition”.

1 Like

You can use redirect_stderr or redirect_stdio to write it to a file instead. If the identifying segfault or GC error is being covered up by all those lines, it could be worth checking if it’s the same corruption or some other error.

In one situation I got a log file with 300+ MB. There are other cases where the output is endless…
And in some cases the output is short, like that in #2 post.

The error’s information is not deterministic, sometimes segfault, sometimes GC error. But it’s reproducible.

If skim the logging I sometimes can see “matlab_matrix” or something (I made use of the matpower files as the input data of my optimization problem). And I can also see some MathOptInterface and Gurobi stuff, and a lot of “raw” stuff like Memory… It’s not fairly clear what is the real culprit. (I tend to believe the real cause is some ccalls by Gurobi.jl to the Gurobi solver (A related link is GitHub - jump-dev/Gurobi.jl: A Julia interface to the Gurobi Optimizer · GitHub). Other places like PowerModels.jl are less suspicious. But I don’t know if/when I wrote some wrong code, or the error is not due to my user code.)

This is almost certainly a bug in your code. There are too many places where things could go wrong though. Threading, inbounds, ccalls, … add to that that Gurobi is not thread safe.

It’s another reminder that the small performance win is not worth the time it takes to write and debug “tricky” code.

2 Likes

Gurobi is not thread safe.

The method Model!(v, t) in my Settings.jl module can create independent models in independent envs.

And I’ve actually reproduced the error with julia --threads=1,0 --check-bounds=yes.

Additionally, multithreading works well—I’ve worked with it for some time. I think in an application like this I’m working with (stochastic unit commitment), the modeling part can really be a bottleneck where code runs slow.

e.g.

julia> const mst = build_2ssp!(sub, t, T, S, tks);
1728.963785 seconds (132.69 G allocations: 3.919 TiB, 85.34% gc time, 57.22% compilation time: <1% of which was recompilation)

I think 1728 seconds in terms of time length is not too bad. But the compile and gc time and allocation is … hard to be satisfying…

(But I still opt to use JuMP, because otherwise the physical constraints are too hard to code.)

Thanks, friends.

It appears to be an oversignt of mine.

This line

Gurobi.GRBgetdblattrarray(mst.o, "X", 1, mst.xllen, mst.Θ)

was a mistake, which should be

Gurobi.GRBgetdblattrarray(mst.o, "X", 1, length(mst.Θ), mst.Θ)

.:smiling_face_with_tear:

It’s basically, something like:
I pass an a = [1., 2., 3.] to Gurobi via a C-Api, and then Gurobi attempt to write a[4]=4.0.
Later on when Julia attempt to do GC, it triggers the error. (I surmise.)


To reproduce:

julia> include("src/Settings.jl")
Main.Settings

julia> import Gurobi, JuMP

julia> m = Settings.Model()
A JuMP Model
├ mode: DIRECT
├ solver: Gurobi
├ objective_sense: FEASIBILITY_SENSE
├ num_variables: 0
├ num_constraints: 0
└ Names registered in the model: none

julia> JuMP.@variable(m, x[1:6]);

julia> m = (m = m, o = m.moi_backend, refi = Ref{Cint}(), X = fill(NaN, 3));

julia> Settings.opt_and_ter(m)==2 || error()
true

julia> if 0 == Gurobi.GRBgetdblattrarray(m.o, "X", 0, 4, m.X)
           @warn "This is a false no-error code"
       end
┌ Warning: This is a false no-error code
└ @ Main REPL[7]:2

julia> G
[2994385] signal 11 (1): Segmentation fault...

Just to check, this means it still should be fine if each thread handled its own model separately from the others?

Ah, check-bounds could only elide @boundscheck expressions from inlining in Julia, not add bounds-checking to code that never had it. I wonder if ASan can even catch bad C writes to Julia allocations.

According to my experience, JuMP.jl and Gurobi solver both support multithreading well. But It make less sense to talk about “threads”. For instance, my workflow is Task-based.

Let’s say we have S scenarios, each scenario will incur a JuMP.Model.
I only have 256 virtual processors subject to my hardware.
If S ≤ 256, then it’s clear: we just build S independent models. And we spawn S tasks, each task manipulating one model. In this case it’s fully parallelized since the hardware has more processors.

However if S becomes larger, say, 10000, then we have two options:

  • Create 256 JuMP models
  • Create 10000 JuMP models

The advantage of the first option is that the allocated memory has an upper bound. And it considers the capability of my hardware.
The advantage of the second option is that the models are honestly independent, corresponding to a specific and deterministic scenario.

I think currently the main bottleneck is that in a decomposition-based optimization workflow, building a Vector of JuMP models in parallel is slow (e.g. it uses dynamic dispatches, allocates a lot… (I guess)). In contrast, LP/MIPs can be solved by Gurobi almost perfectly in parallel (the cpu can be 100% engaged). But that’s a different topic I want to talk somewhere else with Oscar…

For instance, this is the result of my code execution measured with @time, at S=256*20

1096.368006 seconds (70.98 G allocations: 2.090 TiB, 79.74% gc time, 106.36% compilation time: <1% of which was recompilation)
710.525029 seconds (32.53 k allocations: 2.515 MiB, 0.01% compilation time)
  4.671231 seconds (190.43 k allocations: 1.250 GiB, 0.51% compilation time)
1673.125459 seconds (226.75 k allocations: 20.260 MiB, 0.04% compilation time)

, where the 1st line is due to JuMP modeling—building S models in parallel. The rest three lines are due to my own algorithm of solving optimization models with the Gurobi solver. (By observing htop in my ubuntu system, the memory usage is 443GB, after termination). Although it’s still true that solve time > modeling time. There seems to be space of improvement of JuMP (am I right?).