Plea for segfault assistance in PythonCall.jl

giordano · July 14, 2025, 7:04pm

Added in GH-113462: Limit the number of versions that a single class can use. by markshannon · Pull Request #114900 · python/cpython · GitHub, which first appeared in v3.13.0a4 (and apparently it’s undocumented to date)

vchuravy · July 14, 2025, 7:07pm

Ah PyTypeObject has a mirror Definition in PythonCall.jl/src/C/consts.jl at 815c0de27a5a175411447dea250e82d52e40f7c5 · JuliaPy/PythonCall.jl · GitHub and then raw pointers of that strict are handed around

cjdoris · July 14, 2025, 7:10pm

Oooh I have a hypothesis, I think maybe the C struct definitions are out of date in PythonCall, specifically I think they might be too short.

tp->tp_versions_used++ is Python modifying the fields of a python type, probably initialising one of the python types defined in PythonCall. If the memory allocated for the type is too short (because the corresponding struct in PythonCall is wrong) then this increment could be modifying something else - such as the type tag of a Julia object.

If I’m right, we just need to update this struct PythonCall.jl/src/C/consts.jl at 815c0de27a5a175411447dea250e82d52e40f7c5 · JuliaPy/PythonCall.jl · GitHub with any missing fields. And for good measure (and future proofing) we should add padding at the end of this struct and review the other structs in that file.

cjdoris · July 14, 2025, 7:13pm

Ok I’m now more certain of this hypothesis - we are missing the final two fields from the Python type struct here: cpython/Include/cpython/object.h at 9363703bd3bf86e363c14a02e3d729caf1e29f44 · python/cpython · GitHub

And specifically tp_versions_used, which caused the bug above, is the final (missing) field. It must have been added in Python 3.13.

giordano · July 14, 2025, 7:15pm

Can you determine the version of Python at compile-time? Version-dependent structs are annoying to deal with, but if you know the version at compile-time you can do stuff like

github.com/JuliaLang/julia

stdlib/LibGit2/src/types.jl

95c4870e8


      
          @kwdef struct RemoteCallbacks
              version::Cuint                     = Cuint(1)
              sideband_progress::Ptr{Cvoid}      = C_NULL
              completion::Ptr{Cvoid}             = C_NULL
              credentials::Ptr{Cvoid}            = C_NULL
              certificate_check::Ptr{Cvoid}      = certificate_cb()
              transfer_progress::Ptr{Cvoid}      = C_NULL
              update_tips::Ptr{Cvoid}            = C_NULL
              pack_progress::Ptr{Cvoid}          = C_NULL
              push_transfer_progress::Ptr{Cvoid} = C_NULL
              push_update_reference::Ptr{Cvoid}  = C_NULL
              push_negotiation::Ptr{Cvoid}       = C_NULL
              transport::Ptr{Cvoid}              = C_NULL
              @static if LibGit2.VERSION >= v"1.2.0"
                  remote_ready::Ptr{Cvoid}       = C_NULL
              end
              payload::Any                       = nothing
              @static if LibGit2.VERSION >= v"0.99.0"
                  resolve_url::Ptr{Cvoid}        = C_NULL
              end

This file has been truncated. show original

cjdoris · July 14, 2025, 7:18pm

No but we can add a bunch of zero padding at the end for future proofing. All uninitialised fields in the type struct are zero so this is safe.

cjdoris · July 14, 2025, 7:21pm

Ok I just looked in the history and tp_versions_used was indeed added in 3.13 and the other missing field tp_watched was added in 3.12.

That explains why we had seen it occasionally before on 3.12 but only a lot more recently on 3.13, since it’ll be reaching further beyond the end of the struct and so more likely to cause memory corruption.

cjdoris · July 14, 2025, 7:27pm

We can inspect the size of a type object at runtime (e.g. type.__basicsize__) so we can check this at startup and raise an error if it’s out of bounds, which will prevent future nastiness.

MilesCranmer · July 14, 2025, 7:33pm

Massive thank you to the amazing heroes of this thread!

Seems like a huge amount of work was invested into this over the past couple of months by @vchuravy @cjdoris @mkitti, so I can’t thank you enough. Great to have identified this!

Great idea, I was just about to ask whether there’s a way to add a regression test for this. Is there any way to check field types match also? (In case they ever decide to remove a field, say)

cjdoris · July 14, 2025, 7:44pm

julia> sizeof(PythonCall.C.PyTypeObject)
408

julia> pybuiltins.type.__basicsize__
Python: 920

Unfortunately my idea is not very useful - it seems __basicsize__ is a lot larger than the actual size of the struct.

cjdoris · July 14, 2025, 7:45pm

Aha!

julia> pybuiltins.type.__sizeof__(pybuiltins.type)
Python: 416

(And the extra 8 bytes are exactly the ones missing from PythonCall.C.PyTypeObject.)

MilesCranmer · July 14, 2025, 8:47pm

Sweet. I guess the idea is to have regression tests (or even runtime checks??) against the underlying struct?

By the way, any guesses why Linux is so much more resilient against this? Is Linux simply better at recovering from invalid casts? Or was it unable to detect such memory access errors? (The one in @vchuravy’s stack trace)

langestefan · July 14, 2025, 8:51pm

Thank you for figuring this out! Great work!

(I submitted the cvxpygen issue so I was following along )

cjdoris · July 14, 2025, 8:56pm

Yeah a runtime check at __init__ to error out if the actual struct is larger than expected.

If we have padding then the runtime check can check the padded size is large enough. And we can have a test in the test suite that the unpadded size is large enough, since a fail there signals that we are missing some fields.

mbauman · July 14, 2025, 9:07pm

No, this is surely to do with particulars of alignment. I suspect you were largely getting lucky (or unlucky) that the old structure was 408 bytes, and so subsequent 16-byte alignments would skip those last 8 bytes. Julia’s GC-allocated objects may be 4, 8, or 16-byte aligned.

So if that were the only way objects were subsequently allocated, you would’ve never observed this on a 64 bit system. The corruption Valentin caught was in the constant data of the package image itself, which would’ve been dependent upon the particulars of how the package image was serialized. In other words, it was this constant corrupting the adjacent _pyjlbase_name. And I don’t know how often that particular PyType might have its version incremented.

vchuravy · July 15, 2025, 7:45am

Windows also is running a weak form.of ASAN all the time, so it does actually detect out of bounds accesses more often. In Linux you need to use a build with ASAN which is doubly complicated here since you need everything to be compiled with ASAN including Python’s runtime. In my testing I didn’t compile Python with ASAN support, only Julia, so the illegal memory write wasn’t caught by ASAN.

In the end Julia GC debug mode WITH_GC_DEBUG_ENV=1 was crucial since it performs more sanity checks, but even then the memory write could have gone into some location that isn’t as critical.

I hope that my debug logs also show why I and others like rr so much. The cause and effect are separated. Julia is noticing the corruption during GC, but the errand memory write happened much earlier during the initialization of the library.

I would recommend @cjdoris to perhaps setup a CI job where they compile Julia with:

# make the GC use regular malloc/frees, which are hooked by ASAN
override WITH_GC_DEBUG_ENV=1

# default to a debug build for better line number reporting
override JULIA_BUILD_MODE=debug

# Enable Julia assertions and LLVM assertions
FORCE_ASSERTIONS=1
LLVM_ASSERTIONS=1

Which will catch data corruption perhaps a bit more consistently. PythonCall is a rather tricky piece of code since it requires thinking about the implementation of two managed languages.

zerointensity · July 15, 2025, 7:51am

Hi, CPython maintainer here. What can we do to make this easier on Julia?

Currently, we expose the size of the PyTypeObject structure in the debug offsets. (Discourse won’t let me post links, but check out the _Py_DebugOffsets structure in the pycore_debug_offsets.h file.) Is it possible for Julia to check the debug offsets to ensure that it’s mimicking the type object structure correctly?

Alternatively, it’d be nice if Julia didn’t have to use the PyTypeObject structure at all. I’m assuming you guys are using it to define native types? If so, I think it’d be a lot easier to switch to heap types (PyType_Spec), which use an array of slots instead of structure fields to avoid ABI compatibility issues.

Yeah, as I said on the linked PR there, I’m not fully convinced we should be documenting fields like tp_versions_used. It’s an implementation detail used solely as an optimization by the interpreter, and it seems like people only want to know about it so they can mimick the type object structure (in which case, you should just use heap types).

Static types have a number of downsides:

They’re implicitly immortal (see PEP 683).
They have to be immutable for thread-safety reasons (see Py_TPFLAGS_IMMUTABLETYPE)
They cannot be used with multiple inheritance.

cjdoris · July 15, 2025, 3:08pm

Ah amazing, yes we’ll definitely switch to that. AFAIR we only use the structure to create our own types, so we can hopefully remove the definition (and related types) entirely.

cjdoris · July 15, 2025, 6:46pm

vchuravy:

I would recommend @cjdoris to perhaps setup a CI job where they compile Julia with:
# make the GC use regular malloc/frees, which are hooked by ASAN
override WITH_GC_DEBUG_ENV=1

# default to a debug build for better line number reporting
override JULIA_BUILD_MODE=debug

# Enable Julia assertions and LLVM assertions
FORCE_ASSERTIONS=1
LLVM_ASSERTIONS=1
Which will catch data corruption perhaps a bit more consistently.

Just spotted this comment - I’ll make an issue to add this to our CI cheers.

Topic		Replies	Views
Help with PyCall-related segfault General Usage	4	1368	February 26, 2021
Segfault compiling code using PyCall General Usage	1	762	January 24, 2019
JuliaCall segmentation fault General Usage question	5	968	October 14, 2022
Segfault in C library depending on import order General Usage	0	327	August 9, 2019
Segfault in DataStructures.jl General Usage segfault	0	209	January 13, 2023

Plea for segfault assistance in PythonCall.jl

Related topics