Plea for segfault assistance in PythonCall.jl

Hi all,

I was wondering if someone could help us track down a segfault in Julia/PythonCall? It has been hitting us for the past half year in GitHub actions, and has now started to hurt our library’s user base. It has been disruptive for all tools operating at the Python<->Julia interface (e.g.,), and our PySR/SymbolicRegression team has not been able to figure it out despite our best efforts.

The main JuliaLang issue is Segfault in `jl_object_id__cold` on Julia 1.11 · Issue #58171 · JuliaLang/julia · GitHub. This seems to occur most readily on Python 3.13. It can occur on both Julia 1.11 and 1.10. The problem boils down to initializing a bunch of objects wherein the Julia GC will segfault.

export PYTHON_JULIACALL_HANDLE_SIGNALS=yes
 
python -c '
from juliacall import Main as jl
x = [jl.randn(5) for _ in range(100000)]'

There are several reasons this has been a pain to track down, including:

  1. It has not been possible to reproduce this locally on Linux. However, it does rarely occur in CI on linux. It is most reproducible on macOS, and second most on Windows.
  2. It occurs randomly.
  3. It seems to occur most readily in low-memory systems such as GitHub action runners, presumably because the GC is more active.
  4. The number of Julia threads does not affect occurrence.
  5. The stack trace is random each time. According to @vchuravy the jl_object_id__cold only indicates that an object is not rooted in the GC. So the stack trace might not be helpful.

It has been recommended that we build Julia from source in “ASAN mode”. We have not been been successful in compiling this on macOS. There are not any pre-built binaries available for this either.

It was also recommended that we build Julia in debug mode and run under rr chaos mode (on Linux). We tried that but it didn’t reproduce the segfault.

When you do hit the error, the segfault will appear in a random form. Here are the ones I have seen in CI:

Example 1 (Julia 1.10; macos-latest; via `ijl_restore_package_image_from_file -> jl_table_assign_bp`)
jl_object_id__cold at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/builtins.c:455
ijl_object_id_ at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/builtins.c:472 [inlined]
jl_table_assign_bp at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/./iddict.c:47
ijl_idtable_rehash at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/./iddict.c:25 [inlined]
jl_table_assign_bp at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/./iddict.c:101
ijl_eqtable_put at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/./iddict.c:146
jl_as_global_root at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/staticdata.c:2361
jl_root_new_gvars at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/staticdata.c:2131 [inlined]
jl_restore_system_image_from_stream_ at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/staticdata.c:3337
jl_restore_package_image_from_stream at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/staticdata.c:3471
jl_restore_incremental_from_buf at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/staticdata.c:3522 [inlined]
ijl_restore_package_image_from_file at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-XC9YQX9HH2.0/build/default-honeycrisp-XC9YQX9HH2-0/julialang/julia-release-1-dot-10/src/staticdata.c:3606
_include_from_serialized at ./loading.jl:1117

#= truncated =#

run_mod at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
_PyRun_SimpleStringFlagsWithName at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Py_RunMain at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
pymain_main at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Py_BytesMain at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Allocations: 1852181 (Pool: 1850203; Big: 1978); GC: 3
/Users/runner/work/_temp/410754f3-e90a-4c10-9411-9fe93f0cbed2.sh: line 3:  1467 Segmentation fault: 11  python -c 'import pysr'
Example 2 (Julia 1.11; macos-latest; via `ijl_compress_ir -> ... -> smallintset_rehash`)
jl_object_id__cold at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/builtins.c:441
smallintset_rehash at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/smallintset.c:218
jl_smallintset_insert at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/smallintset.c:197
jl_idset_put_idx at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./idset.c:104
jl_as_global_root at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/staticdata.c:2548
jl_encode_as_indexed_root at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/ircode.c:108
jl_encode_value_ at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/ircode.c:444
jl_encode_memory_slice at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/ircode.c:135
jl_encode_value_ at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/ircode.c:403
ijl_compress_ir at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/ircode.c:866
maybe_compress_codeinfo at ./compiler/typeinfer.jl:394

#= truncated =#

PyEval_EvalCode at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
run_eval_code_obj at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
run_mod at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
_PyRun_SimpleStringFlagsWithName at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Py_RunMain at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
pymain_main at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Py_BytesMain at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Allocations: 2882135 (Pool: 2881962; Big: 173); GC: 4
/Users/runner/work/_temp/e37b27f9-612d-4bd8-864c-753733e545f1.sh: line 3:  1496 Segmentation fault: 11  python -c 'import pysr'
Error: Process completed with exit code 139.
Example 3 (Julia 1.11; macos-latest; via `emit_expr -> ... -> smallintset_rehash`)
jl_object_id__cold at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/builtins.c:441
smallintset_rehash at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/smallintset.c:218
jl_smallintset_insert at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/smallintset.c:197
jl_idset_put_idx at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./idset.c:104
jl_as_global_root at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/staticdata.c:2548
emit_expr at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/codegen.cpp:6151
emit_intrinsic at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/./intrinsics.cpp:1271 [inlined]
emit_call at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/codegen.cpp:5203
emit_expr at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/codegen.cpp:6201
emit_ssaval_assign at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-R17H3W25T9.0/build/default-honeycrisp-R17H3W25T9-0/julialang/julia-release-1-dot-11/src/codegen.cpp:5747

#= truncated =#

pymain_run_module at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Py_RunMain at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
pymain_main at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Py_BytesMain at /Library/Frameworks/Python.framework/Versions/3.13/Python (unknown line)
Allocations: 2950021 (Pool: 2949848; Big: 173); GC: 4
/Users/runner/work/_temp/0f2d754b-1c4e-411c-a6ee-443612959c5a.sh: line 1:  9318 Segmentation fault: 11  python -m pysr test main,cli,startup
Error: Process completed with exit code 139.

I have also posted this to CPython as an issue here: Random segfaults on Python 3.12.10 during CI testing · Issue #134193 · python/cpython · GitHub. They seemed confident that the issue is not from Python.

10 Likes

Since you mention that low memory systems seem relevant, have you tried limiting the amount of memory that’s exposed to the julia process through a cgroup? See e.g.

cgroups is Linux only unfortunately

Yes, but since you mentioned that it rarely shows up in linux CI, my thought was that this may help reproduce it on linux too.

I see, will try

in general, reproducing this on linux locally likely should be your goal if possible, since then it can be run within RR.

We just tried. The segfault failed to reproduce. We lowered memory until it triggered a regular OOM “Killed” message from Python. No segfaults were observed in between.

What can we do if we can’t? It’s been several months and we still don’t have a reliable reproducer in Linux. We also couldn’t get this working in rr chaos mode, nor in cgroup’d low memory mode.

are you able to bisect the regression to a specific Julia commit?

I doubt we would be able to; reproducing it is too inconsistent to be confident whether or not a “success” is actually real.

Though maybe @darnstrom could try under rr? Apparently cvxpygen users are starting to hit this as well, but on Linux: Segmentation fault related to pdaqp · Issue #88 · cvxgrp/cvxpygen · GitHub

So maybe he is able to actually get an rr recording to debug this

if you’re able to set up a bug bounty somewhere I’d be willing to pitch a few $ towards it. I don’t really have the expertise to help besides that though

theres a decent chance you know this anyway or that it is irrelevant but this sounds very similar to an issue I am having with PyCall.jl.

I havent really bin able to fully pin this down either but i suppose it is somehow related to the GC interfering with PyCall when multithreading.
For me the workaround mentioned in
this issue seems to work (disable gc while calling python code.

Do you think your issue might be related? if so I could try to find an MWE (if i can manage to consistently reproduce it)

I could try it. However, I have no experience with using rr. Is there a quick way to use it if I have a Python script that (sometimes) leads to the segfault?

One of the things that has frustrated me on this issue is, that I don’t know how to reproduce this with a custom Julia build?

The reproducers so far are all Python driven and it is very unclear to me how to actually set it up in a reliable fashion.

1 Like

After much cursing at Python…

On Linux:

vchuravy@loki ~/s/j/a/PythonCall.jl (main) [1]> LBT_USE_RTLD_DEEPBIND=0 LD_PRELOAD=/home/vchuravy/src/julia-1.11-asan/toolchain/usr/lib/clang/16/lib/linux/libclang_rt.asan-x86_64.so PYTHON_JULIACALL_HANDLE_SIGNALS=yes uv run python -c "from juliacall import Main as jl; [jl.randn(5) for _ in range(10000)]"

[121820] signal 11 (1): Segmentation fault
in expression starting at none:0
gc_assert_parent_validity at /home/vchuravy/src/julia-1.11/src/gc.c:1974
gc_mark_objarray at /home/vchuravy/src/julia-1.11/src/gc.c:2245
gc_mark_outrefs at /home/vchuravy/src/julia-1.11/src/gc.c:2829 [inlined]
gc_mark_loop_serial_ at /home/vchuravy/src/julia-1.11/src/gc.c:2938
gc_mark_loop_serial at /home/vchuravy/src/julia-1.11/src/gc.c:2961
_jl_gc_collect at /home/vchuravy/src/julia-1.11/src/gc.c:3532
ijl_gc_collect at /home/vchuravy/src/julia-1.11/src/gc.c:3893
maybe_collect at /home/vchuravy/src/julia-1.11/src/gc.c:926 [inlined]
jl_gc_big_alloc_inner at /home/vchuravy/src/julia-1.11/src/gc.c:1013
ijl_gc_big_alloc at /home/vchuravy/src/julia-1.11/src/gc.c:1045 [inlined]
jl_gc_pool_alloc_inner at /home/vchuravy/src/julia-1.11/src/gc.c:1317 [inlined]
ijl_gc_pool_alloc_instrumented at /home/vchuravy/src/julia-1.11/src/gc.c:1377
Array at ./boot.jl:579 [inlined]
InstructionStream at ./compiler/ssair/ir.jl:219
IncrementalCompact at ./compiler/ssair/ir.jl:695
IncrementalCompact at ./compiler/ssair/ir.jl:727 [inlined]
IncrementalCompact at ./compiler/ssair/ir.jl:727 [inlined]
sroa_pass! at ./compiler/ssair/passes.jl:1188
run_passes_ipo_safe at ./compiler/optimize.jl:994
run_passes_ipo_safe at ./compiler/optimize.jl:1009 [inlined]
optimize at ./compiler/optimize.jl:983
jfptr_optimize_36954 at /home/vchuravy/src/julia-1.11-asan/asan/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/vchuravy/src/julia-1.11/src/gf.c:2929
ijl_apply_generic at /home/vchuravy/src/julia-1.11/src/gf.c:3125
finish_nocycle at ./compiler/typeinfer.jl:265
_typeinf at ./compiler/typeinfer.jl:249
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2423
abstract_eval_call at ./compiler/abstractinterpretation.jl:2438
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2454
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2752
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3044
typeinf_local at ./compiler/abstractinterpretation.jl:3331
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3413
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2423
abstract_eval_call at ./compiler/abstractinterpretation.jl:2438
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2454
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2752
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3068
typeinf_local at ./compiler/abstractinterpretation.jl:3331
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3413
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2423
abstract_eval_call at ./compiler/abstractinterpretation.jl:2438
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2454
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2752
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3068
typeinf_local at ./compiler/abstractinterpretation.jl:3331
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3413
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2423
abstract_eval_call at ./compiler/abstractinterpretation.jl:2438
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2454
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2752
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3068
typeinf_local at ./compiler/abstractinterpretation.jl:3331
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3413
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_ext at ./compiler/typeinfer.jl:1101
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1139
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1135
jfptr_typeinf_ext_toplevel_34145 at /home/vchuravy/src/julia-1.11-asan/asan/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/vchuravy/src/julia-1.11/src/gf.c:2929
ijl_apply_generic at /home/vchuravy/src/julia-1.11/src/gf.c:3125
jl_apply at /home/vchuravy/src/julia-1.11/src/julia.h:2157 [inlined]
jl_type_infer at /home/vchuravy/src/julia-1.11/src/gf.c:390
jl_generate_fptr_impl at /home/vchuravy/src/julia-1.11/src/jitlayers.cpp:519
jl_compile_method_internal at /home/vchuravy/src/julia-1.11/src/gf.c:2536
_jl_invoke at /home/vchuravy/src/julia-1.11/src/gf.c:2940
ijl_apply_generic at /home/vchuravy/src/julia-1.11/src/gf.c:3125
GeneratedFunctionStub at ./boot.jl:707
_jl_invoke at /home/vchuravy/src/julia-1.11/src/gf.c:2929
ijl_apply_generic at /home/vchuravy/src/julia-1.11/src/gf.c:3125
jl_call_staged at /home/vchuravy/src/julia-1.11/src/method.c:601
ijl_code_for_staged at /home/vchuravy/src/julia-1.11/src/method.c:656
get_staged at ./compiler/utilities.jl:123
retrieve_code_info at ./compiler/utilities.jl:135 [inlined]
InferenceState at ./compiler/inferencestate.jl:497
typeinf_edge at ./compiler/typeinfer.jl:913
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2423
abstract_eval_call at ./compiler/abstractinterpretation.jl:2438
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2454
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2752
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3044
typeinf_local at ./compiler/abstractinterpretation.jl:3331
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3413
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2423
abstract_eval_call at ./compiler/abstractinterpretation.jl:2438
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2454
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2752
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3044
typeinf_local at ./compiler/abstractinterpretation.jl:3331
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3413
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_edge at ./compiler/typeinfer.jl:923
abstract_call_method at ./compiler/abstractinterpretation.jl:660
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:102
abstract_call_known at ./compiler/abstractinterpretation.jl:2200
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2423
abstract_eval_call at ./compiler/abstractinterpretation.jl:2438
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2454
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2752
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3068
typeinf_local at ./compiler/abstractinterpretation.jl:3331
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3413
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_ext at ./compiler/typeinfer.jl:1101
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1139
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1135
jfptr_typeinf_ext_toplevel_34145 at /home/vchuravy/src/julia-1.11-asan/asan/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/vchuravy/src/julia-1.11/src/gf.c:2929
ijl_apply_generic at /home/vchuravy/src/julia-1.11/src/gf.c:3125
jl_apply at /home/vchuravy/src/julia-1.11/src/julia.h:2157 [inlined]
jl_type_infer at /home/vchuravy/src/julia-1.11/src/gf.c:390
jl_generate_fptr_impl at /home/vchuravy/src/julia-1.11/src/jitlayers.cpp:519
jl_compile_method_internal at /home/vchuravy/src/julia-1.11/src/gf.c:2536
_jl_invoke at /home/vchuravy/src/julia-1.11/src/gf.c:2940
ijl_apply_generic at /home/vchuravy/src/julia-1.11/src/gf.c:3125
_pyjl_callmethod at /home/vchuravy/src/julia-1.11-asan/asan/PythonCall.jl/src/JlWrap/base.jl:67
_pyjl_callmethod at /home/vchuravy/src/julia-1.11-asan/asan/PythonCall.jl/src/JlWrap/C.jl:63
jfptr__pyjl_callmethod_11015 at /home/vchuravy/.julia/compiled/v1.11/PythonCall/WdXsa_t9Ws0.so (unknown line)
_jl_invoke at /home/vchuravy/src/julia-1.11/src/gf.c:2929
ijl_apply_generic at /home/vchuravy/src/julia-1.11/src/gf.c:3125
jlcapi__pyjl_callmethod_11112 at /home/vchuravy/.julia/compiled/v1.11/PythonCall/WdXsa_t9Ws0.so (unknown line)
unknown function (ip: 0x7f4517d94c02)
_PyObject_MakeTpCall at /usr/lib/libpython3.13.so.1.0 (unknown line)
_PyEval_EvalFrameDefault at /usr/lib/libpython3.13.so.1.0 (unknown line)
PyObject_Vectorcall at /usr/lib/libpython3.13.so.1.0 (unknown line)
unknown function (ip: 0x7f4517e9bd16)
unknown function (ip: 0x7f4517e0ab83)
PyObject_GetAttr at /usr/lib/libpython3.13.so.1.0 (unknown line)
_PyObject_GetMethod at /usr/lib/libpython3.13.so.1.0 (unknown line)
_PyEval_EvalFrameDefault at /usr/lib/libpython3.13.so.1.0 (unknown line)
PyEval_EvalCode at /usr/lib/libpython3.13.so.1.0 (unknown line)
unknown function (ip: 0x7f4517e8af5b)
unknown function (ip: 0x7f4517e8801a)
unknown function (ip: 0x7f4517e8335d)
unknown function (ip: 0x7f4517e831c2)
Py_RunMain at /usr/lib/libpython3.13.so.1.0 (unknown line)
Py_BytesMain at /usr/lib/libpython3.13.so.1.0 (unknown line)
unknown function (ip: 0x7f4517a376b4)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
_start at /home/vchuravy/src/julia-1.11-asan/asan/PythonCall.jl/.venv/bin/python3 (unknown line)
Allocations: 2007168 (Pool: 0; Other: 2007168); GC: 10
Allocations: 2007168 (Pool: 0; Other: 2007168); GC: 10

The call to gc_assert_parent_validity at /home/vchuravy/src/julia-1.11/src/gc.c:1974 is very interesting.

Looking at the code tells us that GC_VERIFY or GC_ASSERT_PARENT_VALIDITY might be enough as build options to reproduce this on Linux.

ASAN is built with

# make the GC use regular malloc/frees, which are hooked by ASAN
override WITH_GC_DEBUG_ENV=1

# default to a debug build for better line number reporting
override JULIA_BUILD_MODE=debug

# Enable Julia assertions and LLVM assertions
FORCE_ASSERTIONS=1
LLVM_ASSERTIONS=1
2 Likes

The magic command was:

 export PYTHON_JULIAPKG_EXE=$(pwd)/julia

Great that you can now reproduce!

I’m assuming this is 1.11.6 without edits, so, this line? julia/src/gc.c at 9615af0f269df4d371b8010e9507ed5bae86103b · JuliaLang/julia · GitHub

    if (__unlikely(!jl_is_datatype((jl_datatype_t *)child_vt) || ((jl_datatype_t *)child_vt)->smalltag)) {

So, it looks like its segfaulting when accessing a datatype’s fields - is that right? Or is that segfault just specific to the GC verification code itself

The whole thing is utterly bizarre:

vchuravy@loki ~/s/j/a/PythonCall.jl (main)> rr ps
PID	PPID	EXIT	CMD
152071	--	139	uv run python -c from juliacall import Main as jl; [jl.randn(5) for _ in range(10000)]
152075	152071	-11	/home/vchuravy/src/julia-1.11-asan/asan/PythonCall.jl/.venv/bin/python3 -c from juliacall import Main as jl; [jl.randn(5) for _ in range(10000)]
152076	152075	0	/home/vchuravy/src/julia-1.11/julia --version
152077	152076	0	(forked without exec)
152078	152075	0	/home/vchuravy/src/julia-1.11/julia --project=/home/vchuravy/src/julia-1.11-asan/asan/PythonCall.jl/.venv/julia_env --startup-file=no -O0 --compile=min -e import Libdl; print(abspath(Libdl.dlpath("libjulia-debug")), "\0", Sys.BINDIR)
152079	152078	0	(forked without exec)
152083	152075	0	(forked without exec)

vchuravy@loki ~/s/j/a/PythonCall.jl (main)> rr replay -p 152075 -e
...
0x0000287f103f1cc8 in gc_assert_parent_validity (parent=0x5cd61d40f6d0,
    child=0x7eb842e9af00 <jl_system_image_data+8141824>)
    at /home/vchuravy/src/julia-1.11/src/gc.c:1974
1974	    if (__unlikely(!jl_is_datatype((jl_datatype_t *)child_vt) || ((jl_datatype_t *)child_vt)->smalltag)) {
(rr) bt 5
#0  0x0000287f103f1cc8 in gc_assert_parent_validity (parent=0x5cd61d40f6d0,
    child=0x7eb842e9af00 <jl_system_image_data+8141824>)
    at /home/vchuravy/src/julia-1.11/src/gc.c:1974
#1  0x0000287f103f37b2 in gc_mark_objarray (ptls=0x56375cb1ca40,
    obj_parent=0x5cd61d40f6d0, obj_begin=0x56375d3dddb0,
    obj_end=0x56375d3e0460, step=1, nptr=32371)
    at /home/vchuravy/src/julia-1.11/src/gc.c:2245
#2  0x0000287f103f9e2a in gc_mark_outrefs (ptls=0x56375cb1ca40,
    mq=0x56375cb1d8c0, _new_obj=0x5cd61d40f6d0, meta_updated=1)
    at /home/vchuravy/src/julia-1.11/src/gc.c:2829
#3  gc_queue_remset (ptls=0x56375cb1ca40, ptls2=0x56375cb1ca40)
    at /home/vchuravy/src/julia-1.11/src/gc.c:3252
#4  0x0000287f103fb444 in _jl_gc_collect (ptls=0x56375cb1ca40,
    collection=JL_GC_AUTO) at /home/vchuravy/src/julia-1.11/src/gc.c:3522
(More stack frames follow...)

So what is that object parent we are looking at?

rr) p jl_(parent)
Memory{Any}(8092, 0x56375d3d0780)[
  "cannot construct a value of type Union{} for return result",
  "invalid Array dimensions",
  "with-static-parameters",
  "cannot convert a value to Union{} for assignment",
  "Cannot call tail on an empty tuple.",
  "typename does not apply to this type",
  "typename does not apply to unions whose components have different typenames",
  Tuple{Vararg{T, N}} where T where N,
  "Union{} does not have elements",
# ...
  (:consistent, :notaskstate, :nothrow),
  Tuple{typeof(Core.Compiler.convert), Type{T}, Core.Compiler.CallInfo} where T,
  Core.Compiler.ComposedFunction{typeof(Core.Compiler.:(The program being debugged received signal SIGSEGV, Segmentation fault

This looks like constant data…

We also see:

#0  0x0000287f103f1cc8 in gc_assert_parent_validity (parent=0x5cd61d40f6d0,
    child=0x7eb842e9af00 <jl_system_image_data+8141824>)

So the “corrupted” child is from some package image.

(rr) info sym 0x7eb842e9af00
jl_system_image_data + 8141824 in section .ldata of /home/vchuravy/.julia/compiled/v1.11/PythonCall/WdXsa_FQMdA.so

Uhm, great, we have either corrupted some constant data or stored some illegal data in a package image.

2 Likes

So we have some more breadcrumbs to follow.

  1. Let’s watch for any memory write to the object
0x00006e36383e2cc8 in gc_assert_parent_validity (parent=0x43fd6a5f76d0,
    child=0x588277d1b3c0 <jl_system_image_data+8141504>)
    at /home/vchuravy/src/julia-1.11/src/gc.c:1974
1974	    if (__unlikely(!jl_is_datatype((jl_datatype_t *)child_vt) || ((jl_datatype_t *)child_vt)->smalltag)) {
(rr) watch *(uintptr_t*)0x588277d1b3c0
Hardware watchpoint 2: *(uintptr_t*)0x588277d1b3c0
  1. Every julia object has a tag before it. If that tag is corrupted we don’t know what is stored there
(rr) p/x child_astagged
$4 = 0x588277d1b3b8
(rr) watch *(uintptr_t*)0x588277d1b3b8
Hardware watchpoint 3: *(uintptr_t*)0x588277d1b3b8
  1. The object is currently stored in an array, so we have the “slot” address and can watch when this object got written to this array
(rr) up
#1  0x00006e36383e47b2 in gc_mark_objarray (ptls=0x5607db3e3fc0, obj_parent=0x43fd6a5f76d0, obj_begin=0x5607dbca7498, obj_end=0x5607dbca98a0, step=1,
    nptr=32370) at /home/vchuravy/src/julia-1.11/src/gc.c:2245
(rr) p slot
$5 = (jl_value_t **) 0x5607dbca7498
(rr) watch *(jl_value_t **) 0x5607dbca7498
Hardware watchpoint 4: *(jl_value_t **) 0x5607dbca7498
(rr) rc

Okay, let’s see if we hit any useful breakpoints.

Continuing.
Downloading 343.98 K source file /usr/src/debug/python/Python-3.13.5/Objects/typeobject.c

Thread 2 hit Hardware watchpoint 3: *(uintptr_t*)0x588277d1b3b8

Old value = 65703
New value = 167
set_version_unlocked (tp=0x588277d1b220 <jl_system_image_data+8141088>, version=238) at Objects/typeobject.c:972
972	        tp->tp_versions_used++;

Uhm, why is Python smashing our type-tag?

(rr) bt
#0  set_version_unlocked (tp=0x588277d1b220 <jl_system_image_data+8141088>,
    version=238) at Objects/typeobject.c:972
#1  assign_version_tag (interp=interp@entry=0x506a4085a5b0 <_PyRuntime+88400>,
    type=<optimized out>) at Objects/typeobject.c:1164
#2  0x0000506a404b0ce3 in assign_version_tag (
    interp=interp@entry=0x506a4085a5b0 <_PyRuntime+88400>,
    type=<optimized out>) at Objects/typeobject.c:1154
#3  0x0000506a404b0ce3 in assign_version_tag (
    interp=interp@entry=0x506a4085a5b0 <_PyRuntime+88400>,
    type=type@entry=0x5607dba92b90) at Objects/typeobject.c:1154
#4  0x0000506a4056395a in _PyType_LookupRef (type=0x5607dba92b90,
    name=0x506a4084e510 <_PyRuntime+39088>) at Objects/typeobject.c:5293
#5  _Py_slot_tp_getattr_hook (self=0x548276983940, name=0x321c2d527a20)
    at Objects/typeobject.c:9629
#6  0x0000506a404bc3c2 in PyObject_GetAttr (v=0x548276983940,
    name=0x321c2d527a20) at Objects/object.c:1261
#7  0x0000506a404f377f in _PyObject_GetMethod (obj=0x548276983940,
    name=0x321c2d527a20, method=0x7ffd67ad4fb8) at Objects/object.c:1552
#8  0x0000506a404d3218 in _PyEval_EvalFrameDefault (tstate=<optimized out>,
    frame=<optimized out>, throwflag=<optimized out>)
    at Python/generated_cases.c.h:3748
#9  0x0000506a405a58b9 in PyEval_EvalCode (co=0x30ac6239a170,
    globals=<optimized out>, locals=0x321c2d4f0800) at Python/ceval.c:604
#10 0x0000506a405e3f5c in run_eval_code_obj (
    tstate=tstate@entry=0x506a40889df0 <_PyRuntime+283024>,
    co=co@entry=0x30ac6239a170, globals=globals@entry=0x321c2d4f0800,
    locals=locals@entry=0x321c2d4f0800) at Python/pythonrun.c:1381
#11 0x0000506a405e101b in run_mod (mod=mod@entry=0x5607daf3fe88,
    filename=filename@entry=0x321c2d4f08f0,
    globals=globals@entry=0x321c2d4f0800, locals=locals@entry=0x321c2d4f0800,
    flags=flags@entry=0x7ffd67ad52f8, arena=arena@entry=0x30ac6230fd30,
    interactive_src=0x321c2d511ae0, generate_new_source=0)
    at Python/pythonrun.c:1466
#12 0x0000506a405dc35e in _PyRun_StringFlagsWithName (
    str=str@entry=0x321c2d511400 "from juliacall import Main as jl; [jl.randn(5) for _ in range(10000)]\n", name=name@entry=0x321c2d4f08f0,
    start=start@entry=257, globals=globals@entry=0x321c2d4f0800,
    locals=locals@entry=0x321c2d4f0800, flags=flags@entry=0x7ffd67ad52f8,
    generate_new_source=0) at Python/pythonrun.c:1261

We can confirm that prior to that we had a valid Julia object.

(rr) p jl_(0x588277d1b3c0)
"juliacall.ValueBase"
$6 = void

So yay? We found our culprit? For some reason python is smashing the type-tag of a Julia object…

Out of curiosity I kept reverse executing.

Thread 2 hit Hardware watchpoint 4: *(jl_value_t **) 0x5607dbca7498

Old value = (jl_value_t *) 0x588277d1b3c0 <jl_system_image_data+8141504>
New value = (jl_value_t *) 0x0
0x00006e3638378e6f in jl_genericmemory_ptr_set (m=0x43fd6a5f76d0, i=6939,
    x=0x588277d1b3c0 <jl_system_image_data+8141504>)
    at /home/vchuravy/src/julia-1.11/src/julia.h:1201
1201	    jl_atomic_store_release(((_Atomic(jl_value_t*)*)(m_->ptr)) + i, (jl_value_t*)x);
0x00006e3638378e6f in jl_genericmemory_ptr_set (m=0x43fd6a5f76d0, i=6939,
    x=0x588277d1b3c0 <jl_system_image_data+8141504>)
    at /home/vchuravy/src/julia-1.11/src/julia.h:1201
1201	    jl_atomic_store_release(((_Atomic(jl_value_t*)*)(m_->ptr)) + i, (jl_value_t*)x);
(rr) bt
#0  0x00006e3638378e6f in jl_genericmemory_ptr_set (m=0x43fd6a5f76d0, i=6939,
    x=0x588277d1b3c0 <jl_system_image_data+8141504>)
    at /home/vchuravy/src/julia-1.11/src/julia.h:1201
#1  0x00006e363837c1fd in jl_idset_put_key (keys=0x43fd6a5f76d0,
    key=0x588277d1b3c0 <jl_system_image_data+8141504>, newidx=0x7ffd67acc4e0)
    at /home/vchuravy/src/julia-1.11/src/idset.c:89
#2  0x00006e36383b029b in jl_as_global_root (
    val=0x588277d1b3c0 <jl_system_image_data+8141504>, insert=1)
    at /home/vchuravy/src/julia-1.11/src/staticdata.c:2547
#3  0x00006e36383af2c9 in jl_root_new_gvars (s=0x7ffd67accb70,
    image=0x7ffd67ace040, external_fns_begin=2748)
    at /home/vchuravy/src/julia-1.11/src/staticdata.c:2283
#4  0x00006e36383b4ee1 in jl_restore_system_image_from_stream_ (
    f=0x7ffd67acdf10, image=0x7ffd67ace040, depmods=0x43fd6a5142f0,
    checksum=18085326892769599201, restored=0x7ffd67acdc50,
    init_order=0x7ffd67acdc58, extext_methods=0x7ffd67acdc60,
    internal_methods=0x7ffd67acdc68, new_ext_cis=0x7ffd67acdc70,
    method_roots_list=0x7ffd67acdc78, ext_targets=0x7ffd67acdc80,
    edges=0x7ffd67acdc88, base=0x7ffd67acdc48, ccallable_list=0x7ffd67acdd90,
    cachesizes=0x7ffd67acdcf0)
    at /home/vchuravy/src/julia-1.11/src/staticdata.c:3532
#5  0x00006e36383b5716 in jl_restore_package_image_from_stream (
    pkgimage_handle=0x5607db723940, f=0x7ffd67acdf10, image=0x7ffd67ace040,
    depmods=0x43fd6a5142f0, completeinfo=0,
    pkgname=0x43fd697287f8 "PythonCall", needs_permalloc=0)
    at /home/vchuravy/src/julia-1.11/src/staticdata.c:3670
#6  0x00006e36383b5bd8 in jl_restore_incremental_from_buf (
    pkgimage_handle=0x5607db723940,
    buf=0x588277557900 <jl_system_image_data> "\373jli\r\n\032\n\f",
    image=0x7ffd67ace040, sz=11933428, depmods=0x43fd6a5142f0, completeinfo=0,
    pkgname=0x43fd697287f8 "PythonCall", needs_permalloc=0)
    at /home/vchuravy/src/julia-1.11/src/staticdata.c:3729
#7  0x00006e36383b6267 in ijl_restore_package_image_from_file (
    fname=0x43fd6aaba138 "/home/vchuravy/.julia/compiled/v1.11/PythonCall/WdXsa_9Ah4Y.so", depmods=0x43fd6a5142f0, completeinfo=0,
    pkgname=0x43fd697287f8 "PythonCall", ignore_native=0)

So it is a global variable that we are restoring…
perhaps

I have no clue what this code is trying to do, but in any case python should not stamp over Julia’s type-tag.

7 Likes

Unless someone is miscalculating how big a PyTypeObject is… it seems somewhat suspicious that the tp_versions_used field is the last one, no? Or maybe even more suspicious: it’s a new field added in a recent CPython version?

4 Likes