Added in GH-113462: Limit the number of versions that a single class can use. by markshannon · Pull Request #114900 · python/cpython · GitHub, which first appeared in v3.13.0a4 (and apparently it’s undocumented to date)
Ah PyTypeObject has a mirror Definition in PythonCall.jl/src/C/consts.jl at 815c0de27a5a175411447dea250e82d52e40f7c5 · JuliaPy/PythonCall.jl · GitHub and then raw pointers of that strict are handed around
Oooh I have a hypothesis, I think maybe the C struct definitions are out of date in PythonCall, specifically I think they might be too short.
tp->tp_versions_used++
is Python modifying the fields of a python type, probably initialising one of the python types defined in PythonCall. If the memory allocated for the type is too short (because the corresponding struct in PythonCall is wrong) then this increment could be modifying something else - such as the type tag of a Julia object.
If I’m right, we just need to update this struct PythonCall.jl/src/C/consts.jl at 815c0de27a5a175411447dea250e82d52e40f7c5 · JuliaPy/PythonCall.jl · GitHub with any missing fields. And for good measure (and future proofing) we should add padding at the end of this struct and review the other structs in that file.
Ok I’m now more certain of this hypothesis - we are missing the final two fields from the Python type struct here: cpython/Include/cpython/object.h at 9363703bd3bf86e363c14a02e3d729caf1e29f44 · python/cpython · GitHub
And specifically tp_versions_used
, which caused the bug above, is the final (missing) field. It must have been added in Python 3.13.
Can you determine the version of Python at compile-time? Version-dependent structs are annoying to deal with, but if you know the version at compile-time you can do stuff like
No but we can add a bunch of zero padding at the end for future proofing. All uninitialised fields in the type struct are zero so this is safe.
Ok I just looked in the history and tp_versions_used
was indeed added in 3.13 and the other missing field tp_watched
was added in 3.12.
That explains why we had seen it occasionally before on 3.12 but only a lot more recently on 3.13, since it’ll be reaching further beyond the end of the struct and so more likely to cause memory corruption.
We can inspect the size of a type object at runtime (e.g. type.__basicsize__
) so we can check this at startup and raise an error if it’s out of bounds, which will prevent future nastiness.
Massive thank you to the amazing heroes of this thread!
Seems like a huge amount of work was invested into this over the past couple of months by @vchuravy @cjdoris @mkitti, so I can’t thank you enough. Great to have identified this!
Great idea, I was just about to ask whether there’s a way to add a regression test for this. Is there any way to check field types match also? (In case they ever decide to remove a field, say)
julia> sizeof(PythonCall.C.PyTypeObject)
408
julia> pybuiltins.type.__basicsize__
Python: 920
Unfortunately my idea is not very useful - it seems __basicsize__
is a lot larger than the actual size of the struct.
Aha!
julia> pybuiltins.type.__sizeof__(pybuiltins.type)
Python: 416
(And the extra 8 bytes are exactly the ones missing from PythonCall.C.PyTypeObject
.)
Sweet. I guess the idea is to have regression tests (or even runtime checks??) against the underlying struct?
By the way, any guesses why Linux is so much more resilient against this? Is Linux simply better at recovering from invalid casts? Or was it unable to detect such memory access errors? (The one in @vchuravy’s stack trace)
Thank you for figuring this out! Great work!
(I submitted the cvxpygen issue so I was following along )
Yeah a runtime check at __init__
to error out if the actual struct is larger than expected.
If we have padding then the runtime check can check the padded size is large enough. And we can have a test in the test suite that the unpadded size is large enough, since a fail there signals that we are missing some fields.
No, this is surely to do with particulars of alignment. I suspect you were largely getting lucky (or unlucky) that the old structure was 408 bytes, and so subsequent 16-byte alignments would skip those last 8 bytes. Julia’s GC-allocated objects may be 4, 8, or 16-byte aligned.
So if that were the only way objects were subsequently allocated, you would’ve never observed this on a 64 bit system. The corruption Valentin caught was in the constant data of the package image itself, which would’ve been dependent upon the particulars of how the package image was serialized. In other words, it was this constant corrupting the adjacent _pyjlbase_name
. And I don’t know how often that particular PyType might have its version incremented.
Windows also is running a weak form.of ASAN all the time, so it does actually detect out of bounds accesses more often. In Linux you need to use a build with ASAN which is doubly complicated here since you need everything to be compiled with ASAN including Python’s runtime. In my testing I didn’t compile Python with ASAN support, only Julia, so the illegal memory write wasn’t caught by ASAN.
In the end Julia GC debug mode WITH_GC_DEBUG_ENV=1
was crucial since it performs more sanity checks, but even then the memory write could have gone into some location that isn’t as critical.
I hope that my debug logs also show why I and others like rr so much. The cause and effect are separated. Julia is noticing the corruption during GC, but the errand memory write happened much earlier during the initialization of the library.
I would recommend @cjdoris to perhaps setup a CI job where they compile Julia with:
# make the GC use regular malloc/frees, which are hooked by ASAN
override WITH_GC_DEBUG_ENV=1
# default to a debug build for better line number reporting
override JULIA_BUILD_MODE=debug
# Enable Julia assertions and LLVM assertions
FORCE_ASSERTIONS=1
LLVM_ASSERTIONS=1
Which will catch data corruption perhaps a bit more consistently. PythonCall is a rather tricky piece of code since it requires thinking about the implementation of two managed languages.
Hi, CPython maintainer here. What can we do to make this easier on Julia?
Currently, we expose the size of the PyTypeObject
structure in the debug offsets. (Discourse won’t let me post links, but check out the _Py_DebugOffsets
structure in the pycore_debug_offsets.h
file.) Is it possible for Julia to check the debug offsets to ensure that it’s mimicking the type object structure correctly?
Alternatively, it’d be nice if Julia didn’t have to use the PyTypeObject
structure at all. I’m assuming you guys are using it to define native types? If so, I think it’d be a lot easier to switch to heap types (PyType_Spec
), which use an array of slots instead of structure fields to avoid ABI compatibility issues.
Yeah, as I said on the linked PR there, I’m not fully convinced we should be documenting fields like tp_versions_used
. It’s an implementation detail used solely as an optimization by the interpreter, and it seems like people only want to know about it so they can mimick the type object structure (in which case, you should just use heap types).
Static types have a number of downsides:
- They’re implicitly immortal (see PEP 683).
- They have to be immutable for thread-safety reasons (see
Py_TPFLAGS_IMMUTABLETYPE
) - They cannot be used with multiple inheritance.
Ah amazing, yes we’ll definitely switch to that. AFAIR we only use the structure to create our own types, so we can hopefully remove the definition (and related types) entirely.
Just spotted this comment - I’ll make an issue to add this to our CI cheers.