Avoid LLVM setjmp bug

jameson · December 24, 2016, 7:43pm

Currently, we can’t reliably run LLVM optimizations on any function with a try/catch since it mis-optimizes it. This is captured in the following issue:

github.com/JuliaLang/julia

LLVM mis-optimize due to returntwice function

opened 09:41PM - 05 Jul 16 UTC

maleadt

bug upstream codegen llvm correctness bug ⚠

One of my packages (CUDAdrv) has recently started failing on julia master, with …a segfault in `typemap.c`. I've bisected this issue to e2bd1298732f36465fbd5d112d959fe1de052c7c (all backtraces and line numbers below are on that commit's tree). I'm not sure where to start debugging this, so I'm at least reporting it here already. This causes a segfault in `jl_typemap_level_assoc_exact`: ``` signal (11): Segmentation fault while loading CUDAdrv/test/core.jl, in expression starting on line 172 sig_match_fast at src/gf.c:1707 jl_apply_generic at src/gf.c:1886 Type at CUDAdrv/src/module.jl:67 unknown function (ip: 0x7f6198ad561e) jl_call_method_internal at src/julia_internal.h:92 jl_apply_generic at src/gf.c:1931 do_call at src/interpreter.c:65 eval at src/interpreter.c:188 eval_body at src/interpreter.c:469 eval_body at src/interpreter.c:515 jl_interpret_call at src/interpreter.c:573 jl_interpret_toplevel_thunk at src/interpreter.c:580 jl_toplevel_eval_flex at src/toplevel.c:543 jl_parse_eval_all at src/ast.c:700 jl_load at src/toplevel.c:566 jl_load_ at src/toplevel.c:575 include_from_node1 at ./loading.jl:426 unknown function (ip: 0x7f639ef3716c) jl_call_method_internal at src/julia_internal.h:92 jl_apply_generic at src/gf.c:1931 do_call at src/interpreter.c:65 eval at src/interpreter.c:188 jl_interpret_toplevel_expr at src/interpreter.c:31 jl_toplevel_eval_flex at src/toplevel.c:529 jl_parse_eval_all at src/ast.c:700 jl_load at src/toplevel.c:566 jl_load_ at src/toplevel.c:575 include_from_node1 at ./loading.jl:426 unknown function (ip: 0x7f639ef3716c) jl_call_method_internal at src/julia_internal.h:92 jl_apply_generic at src/gf.c:1931 process_options at ./client.jl:266 _start at ./client.jl:322 unknown function (ip: 0x7f639ef75124) jl_call_method_internal at src/julia_internal.h:92 jl_apply_generic at src/gf.c:1931 jl_apply at ui/../src/julia.h:1396 true_main at ui/repl.c:546 main at ui/repl.c:674 unknown function (ip: 0x7f63a5013740) unknown function (ip: 0x401818) Allocations: 1748036 (Pool: 1746766; Other: 1270); GC: 4 Allocations: 1748036 (Pool: 1746766; Other: 1270); GC: 4 ``` Running in GDB makes it segfault somewhere else, but I assume due to the same problem (`jl_typeof(NULL)`): ``` Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0x00007ffff76c9407 in jl_typemap_level_assoc_exact (cache=0x7ffdf1802950, args=0x7fffffffae90, n=3, offs=1 '\001') at src/typemap.c:788 788 jl_value_t *ty = (jl_value_t*)jl_typeof(a1); (gdb) l 783 784 jl_typemap_entry_t *jl_typemap_level_assoc_exact(jl_typemap_level_t *cache, jl_value_t **args, size_t n, int8_t offs) 785 { 786 if (n > offs) { 787 jl_value_t *a1 = args[offs]; 788 jl_value_t *ty = (jl_value_t*)jl_typeof(a1); 789 assert(jl_is_datatype(ty)); 790 if (ty == (jl_value_t*)jl_datatype_type && cache->targ != (void*)jl_nothing) { 791 union jl_typemap_t ml_or_cache = mtcache_hash_lookup(cache->targ, a1, 1, offs); 792 jl_typemap_entry_t *ml = jl_typemap_assoc_exact(ml_or_cache, args, n, offs+1); (gdb) call jl_(args[0]) Base.#==() (gdb) p args[1] $3 = (jl_value_t *) 0x0 (gdb) call jl_(args[2]) CUDAdrv.CuError(code=209, info=Base.Nullable{String}(isnull=true, value=#<null>)) ``` ... with this comparison (against 209 == `CUDAdrv.ERROR_NO_BINARY_FOR_GPU`) originating from: ``` julia try @apicall(:cuModuleLoadDataEx, (Ptr{CuModule_t}, Ptr{Cchar}, Cuint, Ref{CUjit_option}, Ref{Ptr{Void}}), handle_ref, data, length(optionKeys), optionKeys, optionValues) catch err (err == ERROR_NO_BINARY_FOR_GPU || err == ERROR_INVALID_IMAGE) || rethrow(err) options = decode(optionKeys, optionValues) rethrow(CuError(err.code, options[ERROR_LOG_BUFFER])) end ``` I've not been able to reduce the test case, as reduced versions did not reliably trigger the segfault on all my systems anymore, while the full CUDAdrv test suite does. I've tested on two Linux64 systems (one Debian 8, one Arch), with fresh builds without any Makefile flags. @yuyichao any ideas what might be causing this, or where to look for clues?

Codegen bugs are always really nasty, since they can be so unpredictable. Today I realized we may be able to change our codegen representation to avoid this case with minimal effort! If codegen always out-lined the body of try/catch code, it should no longer be able to generate bad code. I think this may even let it perform better optimizations than currently. The key here is that the return path from setjmp needs to avoid referencing any state.

In pseudo-C syntax, this would mean codegen would take the Julia function:

function f()
  setup
  try
    try-body
  catch
    catch-body
  end
  rest-of-function
end

And emit something of the form:

jl_value_t *f(args, …) { /* the actual function */
  alloca /* local state */
  <setup> /* user code */
  switch (f_trycatch(&alloca)) {
  case 0: /* normal fall-through */
    break;
  case 1: /* user code */
    <catch-body>
    break;
  default:
    abort(); /* corrupted codegen */
  }
  <rest-of-function> /* user code */
}
int f_trycatch(struct* alloca) {
  /* return value describes control flow continuation path */
  /* all other local state (including a gc-frame)
  /* is packaged into the alloca struct that is passed as a pointer argument */
  if (int control-flow = setjmp(&alloca->jmpbuf)) /* Expr(:enter) */
    return control-flow;
  <try-body> /* user code */
  return 0; /* Expr(:leave) */
}

Since there’s already a couple function calls on this code path, I don’t think the addition of the extra local jmp statement will impact performance. While the removal of volatile load / store may actually permit improved codegen optimizations (so, net benefit).

I think this should work, but having a second person review this concept is always good, so I’m posting here for review and other suggestions.

Topic		Replies	Views
Segmentation fault in garbage collector New to Julia	12	1770	April 18, 2020
Bug with Julia 1.7.1 and CUDA 3.3 GPU bug , cuda	26	2398	June 2, 2022
LLVM crash when running Flux and CuArray examples in julia 0.7 GPU cudanative , bug , debugging , flux	13	1627	August 21, 2018
Segfault on exit General Usage question	1	496	January 30, 2019
CUDA with IJulia results in unexpected errors GPU	7	635	June 18, 2021

Avoid LLVM setjmp bug

Related topics