What is the function of the unique "identifier" in a package's path used for

I’m still confused about a package’s directory structure. What is the unique “version” identifier included in the path used for? As in “1GXzF” below. We already include a uuid and version in the Project.toml / manifest.toml, and the “1GXzF” doesn’t seem to be referenced anywhere else – why don’t we just use the vesion specified in the toml files. How does this fit into the process?

.julia/packages/Zygote/1GXzF/src/flux.jl

1 Like

It is a hash of the UUID and the content hash (SHA1) of the package source code. It is necessary to support multiple installations of the same package, and multiple packages with the same name.

3 Likes

Imagine you have two projects, one using version A of Zygote and the other using version B. If you only ever have one version on your computer, you’d have to repeatedly redownload the correct version if you switch between the two projects. By having both available, you only have to download a given version once. The 1GXzF serves as the part identifying the different versions on disk, as explained by @fredrikekre.

I get that, but why use a randomized identifier – why not use the version in the Project.toml, which is a lot more informative when you’re trying to find a version.

I would even understand if this was somehow related to the github sha, where you may have multiple checkpoints per version, but this just doesn’t seem to serve any purpose

Because you can install different “versions” of a package that have the same Project.toml version (because you can install packages by tag or branch or commit sha, not just by version).

2 Likes

Though you bring up a good point. Instead of 1GXzF, why not do v0.12.6_1GXzF or something similar?

3 Likes

ahh — ok, got it… that makes sense.

Thanks
(although, I’m guessing the uuid in the Project.toml will be different… but I get that)

Good question. Here are a couple of reasons why using version numbers doesn’t work so well:

  • You can install and use snapshots of packages that don’t have any version number assigned to them, so what version number would you used for those?
  • The same snapshot of a package can be assigned multiple different version numbers—e.g. the last release candidate of Julia becomes the release. Should one have to install such versions multiple times with identical trees because they’re referred to by different names?
  • You might have installed some snapshot of a package before it was given some version number. Should the assignment of a version number after you were using it cause you to have to install it again even though you already have it because you want to put the version numbers in the path?

On the other hand, every package snapshot has a tree hash which, for all practical purposes is unique (barring SHA1 collisions, which are astronomically unlikely to happen by accident). So, why not use the full unique tree hash instead of a five character “slug”, you may ask? Because hashes are long:

  1. There exist file systems with path length limits that can cause problems if you stick a 40-character hash in every path—this is a real problem that I hit in practice when originally developing this;
  2. You have to see those slugs in paths, and it’s not uncommon to see source file paths in editors and elsewhere, so having them not be too distractingly long is helpful.

Doesn’t the shortness of the slug destroy the uniqueness? Yes, but it’s “scoped” within a package name so there would have to two different package snapshots with the same name having a slug collision. Since there are 62^5 possible slugs, you’d need to install more than 35k different snapshots with the same package name before you had a 50% chance of having a slug collision. If you’re on a case insensitive file system, the number is somewhat smaller—you’d need to install more than 9k different snapshots in order to have a 50% chance of a collision. That seems unlikely enough to not worry about it. Even if we get to the point where there are 9,000 versions of some package it seems unlikely that anyone would have all of them installed at the same time.

12 Likes

I went to check how many releases packages have, right now even the most active package only have 100-200 releases. In general one can probably expect the pace to decrease once stable release is hit, for exmaple pands and numpy each only has <200 releases… so… yeak, having O(10k) different versions is unlikely, not to mention actually having them all installed at the same time.

2 Likes

OTOH if there are 1000 packages with more than 300 versions, then probably one of them will have a clashing pair of slugs. Or 10,000 packages with more than 100 versions.

It doesn’t affect the main argument but I have to nitpick with some of the reasoning.

This is not true for packages, is it? The version number is written in stone in Project.toml inside the snapshot and won’t change for the same snapshot. Multiple snapshots may have the same version number though.

I think the version in Project.toml isn’t really relevant; the source of truth is always the registry, no? And nothing stops me from registering multiple versions pointing to the same treesha.

1 Like

It can hardly be the source of truth for branches and snapshots which are not in the registry. A third potential source of truth is each manifest. Which is really the truth probably depends on which part of Julia you ask, but nothing good will come from having these out of sync.

That’s a question of definition. Registrator won’t let you, neither LocalRegistry and hopefully no other package that people use for registering. You can of course edit the registry to your liking and in that sense it’s possible but if you try pkg"add Package#treehash" Julia will look up the version from Project.toml regardless what you write in the registry.

But you raise a good point. RegistryCI shouldn’t let such inconsistencies pass: Version number consistency · Issue #432 · JuliaRegistries/RegistryCI.jl · GitHub.

Don’t forget that the package name also has to match, as it’s part of the path in question. So in your example those 1.000 (10.000) packages would all have to have the same name.

No I accounted for that. To put what I said a different way, if a package has 300 versions, there’s roughly a 1/1000 chance it repeats a slug. So if there are 1000 such packages then there is a high chance one of them repeats a slug.

The very fact that there’s room for debate about this proves (to me) that version numbers are not good for this purpose. There are version numbers in project files and registries and they normally agree, but they might not, they also might change even though they’re not supposed to. So I think the fact that using them to identify snapshots is a bad idea is borne out.

Here is a function to see if there are any such possible collisions in a given registry:

using TOML, UUIDs

function check_duplicate_slugs(registry::AbstractString=joinpath(homedir(), ".julia", "registries", "General"))
    reg = TOML.parsefile(joinpath(registry, "Registry.toml"))
    d = Dict{String, Dict{String, Vector{VersionNumber}}}()
    for (uuid, pkg) in reg["packages"]
        git_tree_shas_pkg = Set{Base.SHA1}()
        p = joinpath(registry, pkg["path"])
        v = TOML.parsefile(joinpath(registry, p, "Versions.toml"))
        for (version, version_data) in v
            git_tree_sha1 = Base.SHA1(version_data["git-tree-sha1"])
            git_tree_sha1 in git_tree_shas_pkg && continue
            push!(git_tree_shas_pkg, git_tree_sha1)
            slug = Base.version_slug(UUID(uuid), git_tree_sha1)
            d_pkg = get!(valtype(d), d, pkg["name"])
            push!(get!(valtype(d_pkg), d_pkg, slug), VersionNumber(version))
        end
    end

    for (pkg, slug_info) in d
        for (slut, versions) in slug_info
            if length(versions) > 1
                @info "Slug collision for pkg $pkg with versions $versions"
            end
        end
    end

    return 
end

check_duplicate_slugs()

Currently there are zero (assuming my implementation is correct :slight_smile: )

1 Like

Here are some examples of packages with different versions but the same snapshot:

DynamicalSystemsBase    versions [v"1.0.0", v"0.12.2"]
TriangleMesh            versions [v"1.0.2", v"1.0.1"]
TriangleMesh            versions [v"1.0.5", v"1.0.6"]
Turing                  versions [v"0.6.0", v"0.6.1"]
MDCT                    versions [v"1.1.1", v"1.1.2"]
Expokit                 versions [v"0.1.0", v"0.0.2"]
CMPFit                  versions [v"0.2.0", v"0.2.1"]
LaTeXStrings            versions [v"1.0.2", v"1.0.3"]
VersionParsing          versions [v"1.1.2", v"1.1.3"]
WebIO                   versions [v"0.7.0", v"0.4.2"]
ApproximateComputations versions [v"0.2.4", v"0.2.3"]
ImageMetadata           versions [v"0.5.1", v"0.4.2"]
Example                 versions [v"0.2.0", v"0.1.0", v"0.0.2"]
CUDA_jll                versions [v"10.2.89+2", v"11.0.2+0"]
LLVM_assert_jll         versions [v"11.0.0+2", v"11.0.0+3"]
LLVM_assert_jll         versions [v"11.0.0+5", v"11.0.0+6", v"11.0.0+4"]
Clang_jll               versions [v"11.0.0+5", v"11.0.0+6"]
DynamicalBilliards      versions [v"2.3.0", v"2.5.0"]
Clang_assert_jll        versions [v"11.0.0+2", v"11.0.0+3"]
Clang_assert_jll        versions [v"11.0.0+5", v"11.0.0+6", v"11.0.0+4"]
LLVM_full_assert_jll    versions [v"11.0.0+9", v"11.0.0+8"]
libLLVM_assert_jll      versions [v"11.0.0+5", v"11.0.0+6", v"11.0.0+4"]
libLLVM_assert_jll      versions [v"11.0.0+2", v"11.0.0+3"]
libLLVM_jll             versions [v"11.0.0+5", v"11.0.0+6"]
BinaryProvider          versions [v"0.5.4", v"0.5.6"]
PolaronMobility         versions [v"1.1.1", v"1.2.0"]

I don’t really know how they were put into the registry… Some from METADATA perhaps, some from manual edits.

1 Like

Right, so clearly we haven’t consistently enforced that no two version numbers can refer identical snapshots of that package. Even if we had consistently enforced that, it seems bad in principe to design a code loading system where this matters. The way we find code has to work for all registries, no matter how versioning in them is implemented or enforced. It also has to work for unregistered snapshots of packages. The one thing these all have is a tree hash. So that’s what we use.

5 Likes

All the non-jll packages I checked come from METADATA, and well, there was no Project.toml at the time.

Most of the jll packages come from introduce fake version for LLVM by vchuravy · Pull Request #26576 · JuliaRegistries/General · GitHub,