Schema evolution: Structs considered harmful, in at least some cases? Non-semver version numbers as an example

I proposed allowing non-semver, as an option, or at least for debate:

It got me thinking how would you implement this (and keep “full” compatibility), and I have a dilemma.

julia> v"∞"  # rarely used option... not even sure is documented, though I think I know when used in obscure cases, i.e. not by most users...
v"∞"

julia> dump(ans)
VersionNumber
  major: UInt32 0xffffffff
  minor: UInt32 0xffffffff
  patch: UInt32 0xffffffff
  prerelease: Tuple{} ()
  build: Tuple{String}
    1: String ""


struct VersionNumber
    epoc::UInt8  # plausibly proposed new member...
    major::VInt
    minor::VInt
    patch::VInt
    prerelease::VerTuple
    build::VerTuple

At a minimum version numbers are stored as 12 bytes (3x larger than needed), plus some overhead for the almost never needed optional parts…, even if those are not used. As you can see the format (sometimes) includes a String, i.e. stored on the heap currently, so requires allocation, and implies GC activity… and heap objects probably take 64+ bytes there, plus the 4-byte pointer (usually, i.e. on 64-bit CPUs).

But my main point is that if I add that 1 byte for epoc (more because of padding, I believe, if not moving it last), then what happens if you (potentially) serialize an object? It’s a different larger struct. I.e. a new struct in newer versions of Julia.

You want to be able to deserialize all objects: This new struct, in new Julia (or for any updated struct in a program in even same Julia version), and, in some or most cases be able to deserialize the old struct, from binary serialized formats made with older Julia. Then this new binary format needs to store the Julia version, and old Julia versions can’t read the format used from newer Julia, even if epoc is in all cases 0 (the default) in the file.

In many, or most(?) formats, e.g. Julia’s Pkg TOML format, the formats are text-based (or some other way of variable-length format avoiding the problem) version numbers are stored as strings, i.e. as part of the larger full text-based format file.

And that’s fully ok. It’s not like version numbers are speed-critical…

Plan B to implement the struct, could be keep the old one with:

major = (epoc << 24) + major

We likely never needed full 4-byte into for the major version number. This limits its max to 16777215 which is hardly limiting, but it IS strictly technically breaking change, in case someone had ever made a larger version number and stored it in a (binary serialized) file…

My preferred solution would (have been) be a “text” format, even in memory, not some JSON or TOML file, in memory, just a plain version string in memory. When it’s constructed, it needs to be validated, but then stored (or rejected). It’s likely fastest for anything needed (at least with my alternative string format that wouldn’t use the heap in most cases, it’s not like the current format never does use the heap anyway).

The downside is that accessing components of the version numbers is slower, because you have to reparse every time, not just access it in a know location. But it’s not for sure that it’s slower, because the current struct format is more verbose than it can be. And because of memory saving, I think the idea can work well for arbitrary structs (maybe not individual numbers, but nobody cares about single numbers).

Comparing version numbers would still be fast for equality, or non-equality checks. Ordering would maybe be slower, maybe not… Printing will be faster. Likely most important operations will be faster, and none of them speed-critical anyway. But extending the format, e.g. adding the epoc prefix is easier, or some hypothetical suffix, or something else.

Currently everything in Julia can be redefined at runtime except structs. There are hacks, including a package to redefine, in the same REPL session, so this is not just about changing across Julia versions, or application versions.

If we have “extensible structs” as a concepts, ALL of them can fit into 4 bytes (fixed-length, with variable-length on the heap an option), because I need a 64-bit pointer as the escape hatch, and with bit-stealing/tagged pointer I can use at least 53-bits for data. That’s basically my new hybrid string idea. But it could be extended to numbers, and all structs. Floats, such as Float64 are all implicitly structs, and most of the time you don’t need 64 bits for numbers. You can e.g. get away with 32 or 16 bits, even for floats WITHOUT accuracy loss.

Have you looked at the following?

What’s the goal here? What is the specific outcome you want, as in “If we implement this issue it will improve Julia [your reasoning goes here]”?

Epochs were added to Python so projects which switch from one sort of versioning scheme to another could keep their packages in order, that’s the feature (some might call it a kludge) they provide. But Julia only has the major.minor.path versioning to begin with, so an epoch isn’t helpful in isolation.

I have my quibbles with SemVer proper, but I think it’s great that Julia embraces one way of versioning packages, and uses it to pin dependencies. Dependency management in Python is hell. The proliferation of valid version strings is maybe not the most central reason for that, but it also doesn’t help.

I guess I don’t see creativity or freedom as very important for versions. You write software, you increase the version, people start yelling at you when it gets to 1.0 (I did mention that SemVer itself has some problems!), people use the version number to pin dependencies, life goes on.

So I consider the correct number of variations in version strings for an ecosystem to have to be one.

3 Likes

I meant this post to be more on schema/struct evolution in general, and only version numbers as an example. But to answer:

Some projects want calendar versioning, like 2024.03, for a full app, or say a game. We do have a few games already written in Julia, in a package form.

I’m not saying calendar versioning is better (always, maybe only in these few cases, nor that apps or games should be packages, but for anything such, UI-heavy, semver doesn’t seem to apply. Maybe we shouldn’t force all packages to use semver. [All the dependencies, actual libraries, of app/game packages could still use semver.]

I just got curious why Python allows a vast array of different types of version number, and IF we could support some of them. I find it likely that Python ecosystem was chaos, predates semver, and many or most are migrating to it, by now. I wasn’t aware of or remembered VersionParsing.jl, and most extensions should live there or elsewhere. But currently at least, it only allows some (or less restrictive punctuation, e.g. mapping comma to periods), but maps all to the version struct in Base, so it could neither support epocs.

Only if we want to allow non-semver, for packages/Pkg, do we want to change Base, and add epocs probably. I’m not even sure we should, I was just looking into then how, and the problem I encountered, which is more general.

1 Like

Reasonable answer, thanks.

Not a problem:

julia> v"2024.03"
v"2024.3.0"

I’m trying to distinguish between “versioning as the Julia runtime sees it” and “semantic versioning”, since the latter is more of a social contract than an implementation. It’s true that this means that once a package adopts CalVer it’s “locked in” to continuing with it, but I consider that a feature, if anything. Making everything more complicated (and therefore at least potentially buggy) so that people can change their minds about versioning, of all things, doesn’t seem like a useful feature to have.

I’m only replying to the part of your post about the associated issue because I don’t entirely understand the rest of it, and probably don’t have anything useful to add. But not changing how Julia does versioning would mean that in that specific application, the issues you’re sketching out won’t arise.

You’re probably right! We could use CalVer for a package already, and it’s not too bad if every year it looks like a major version jump, at least for apps and games, i.e. packages nobody is going to depend on.

About @mkitti’s comment:

Could we just map epoch to the major version number?

E!X.Y just becomes E.X.Y

Maybe, I actually thought we might need E!X.Y.Z, but epoc was maybe only meant to some the CalVer problem (which I now realize doesn’t need/use Z), and only when you change your mind, to adopt semver, which I guess happens in Python land, but we start from semver, and may never have that problem. Going from semver to CalVer doesn’t actually seem to be much of a problem…

Does Pkg have to know if semver is followed or not? how would you signal it? Do we want to to read something into the major version being 4-digit, e.g. 2024?

About “I don’t entirely understand” struct evolution: In short, redefining or just adding anything to structs, such as, is currently not possible (in same REPL session, nor for different sessions, or at least has a potential problem):

struct VersionNumber
    epoc::UInt8  # plausibly proposed new member...
..

You plausibly want to be able to deserialize the old struct into the new one.

Python version specifiers are woefully complex, to the point where the standard says “some of the versioning practices which are technically permitted by the specification are strongly discouraged for new projects.”

Switching from CalVer to SemVer is specifically cited as a reason to use epochs.

I’m just guessing here, but I assume that Pkg doesn’t care at all, it would be strange if date-appearing major versions were special-cased in any way. The General Registry might.

Julia structs are C structs, right down to the padding due to Julia (like C, unlike Rust) always laying fields out in memory as specified. This is a bedrock foundation of the language.

I think the difficulty you’re exploring here is best solved with serializers, rather than anything which would change the nature of concrete struct types in the language itself. Protobufs exist in large part in order to solve the mismatch between the concrete nature of structs in C/C++, and the fact that protocols need to be able to change at a cadence which is independent of systems reading and writing that protocol.

So a serialization strategy based around protobufs might be a good place to start, in coming up with a serialization format which allows deserializing an old struct into a new one.

1 Like