I’m still confused about a package’s directory structure. What is the unique “version” identifier included in the path used for? As in “1GXzF” below. We already include a uuid and version in the Project.toml / manifest.toml, and the “1GXzF” doesn’t seem to be referenced anywhere else – why don’t we just use the vesion specified in the toml files. How does this fit into the process?
It is a hash of the UUID and the content hash (SHA1) of the package source code. It is necessary to support multiple installations of the same package, and multiple packages with the same name.
Imagine you have two projects, one using version A of Zygote and the other using version B. If you only ever have one version on your computer, you’d have to repeatedly redownload the correct version if you switch between the two projects. By having both available, you only have to download a given version once. The 1GXzF serves as the part identifying the different versions on disk, as explained by @fredrikekre.
I get that, but why use a randomized identifier – why not use the version in the Project.toml, which is a lot more informative when you’re trying to find a version.
I would even understand if this was somehow related to the github sha, where you may have multiple checkpoints per version, but this just doesn’t seem to serve any purpose
Because you can install different “versions” of a package that have the same Project.toml version (because you can install packages by tag or branch or commit sha, not just by version).
Good question. Here are a couple of reasons why using version numbers doesn’t work so well:
You can install and use snapshots of packages that don’t have any version number assigned to them, so what version number would you used for those?
The same snapshot of a package can be assigned multiple different version numbers—e.g. the last release candidate of Julia becomes the release. Should one have to install such versions multiple times with identical trees because they’re referred to by different names?
You might have installed some snapshot of a package before it was given some version number. Should the assignment of a version number after you were using it cause you to have to install it again even though you already have it because you want to put the version numbers in the path?
On the other hand, every package snapshot has a tree hash which, for all practical purposes is unique (barring SHA1 collisions, which are astronomically unlikely to happen by accident). So, why not use the full unique tree hash instead of a five character “slug”, you may ask? Because hashes are long:
There exist file systems with path length limits that can cause problems if you stick a 40-character hash in every path—this is a real problem that I hit in practice when originally developing this;
You have to see those slugs in paths, and it’s not uncommon to see source file paths in editors and elsewhere, so having them not be too distractingly long is helpful.
Doesn’t the shortness of the slug destroy the uniqueness? Yes, but it’s “scoped” within a package name so there would have to two different package snapshots with the same name having a slug collision. Since there are 62^5 possible slugs, you’d need to install more than 35k different snapshots with the same package name before you had a 50% chance of having a slug collision. If you’re on a case insensitive file system, the number is somewhat smaller—you’d need to install more than 9k different snapshots in order to have a 50% chance of a collision. That seems unlikely enough to not worry about it. Even if we get to the point where there are 9,000 versions of some package it seems unlikely that anyone would have all of them installed at the same time.
I went to check how many releases packages have, right now even the most active package only have 100-200 releases. In general one can probably expect the pace to decrease once stable release is hit, for exmaple pands and numpy each only has <200 releases… so… yeak, having O(10k) different versions is unlikely, not to mention actually having them all installed at the same time.
OTOH if there are 1000 packages with more than 300 versions, then probably one of them will have a clashing pair of slugs. Or 10,000 packages with more than 100 versions.
It doesn’t affect the main argument but I have to nitpick with some of the reasoning.
This is not true for packages, is it? The version number is written in stone in Project.toml inside the snapshot and won’t change for the same snapshot. Multiple snapshots may have the same version number though.
I think the version in Project.toml isn’t really relevant; the source of truth is always the registry, no? And nothing stops me from registering multiple versions pointing to the same treesha.
It can hardly be the source of truth for branches and snapshots which are not in the registry. A third potential source of truth is each manifest. Which is really the truth probably depends on which part of Julia you ask, but nothing good will come from having these out of sync.
That’s a question of definition. Registrator won’t let you, neither LocalRegistry and hopefully no other package that people use for registering. You can of course edit the registry to your liking and in that sense it’s possible but if you try pkg"add Package#treehash" Julia will look up the version from Project.toml regardless what you write in the registry.
Don’t forget that the package name also has to match, as it’s part of the path in question. So in your example those 1.000 (10.000) packages would all have to have the same name.
No I accounted for that. To put what I said a different way, if a package has 300 versions, there’s roughly a 1/1000 chance it repeats a slug. So if there are 1000 such packages then there is a high chance one of them repeats a slug.
The very fact that there’s room for debate about this proves (to me) that version numbers are not good for this purpose. There are version numbers in project files and registries and they normally agree, but they might not, they also might change even though they’re not supposed to. So I think the fact that using them to identify snapshots is a bad idea is borne out.
Here is a function to see if there are any such possible collisions in a given registry:
using TOML, UUIDs
function check_duplicate_slugs(registry::AbstractString=joinpath(homedir(), ".julia", "registries", "General"))
reg = TOML.parsefile(joinpath(registry, "Registry.toml"))
d = Dict{String, Dict{String, Vector{VersionNumber}}}()
for (uuid, pkg) in reg["packages"]
git_tree_shas_pkg = Set{Base.SHA1}()
p = joinpath(registry, pkg["path"])
v = TOML.parsefile(joinpath(registry, p, "Versions.toml"))
for (version, version_data) in v
git_tree_sha1 = Base.SHA1(version_data["git-tree-sha1"])
git_tree_sha1 in git_tree_shas_pkg && continue
push!(git_tree_shas_pkg, git_tree_sha1)
slug = Base.version_slug(UUID(uuid), git_tree_sha1)
d_pkg = get!(valtype(d), d, pkg["name"])
push!(get!(valtype(d_pkg), d_pkg, slug), VersionNumber(version))
end
end
for (pkg, slug_info) in d
for (slut, versions) in slug_info
if length(versions) > 1
@info "Slug collision for pkg $pkg with versions $versions"
end
end
end
return
end
check_duplicate_slugs()
Currently there are zero (assuming my implementation is correct )
Right, so clearly we haven’t consistently enforced that no two version numbers can refer identical snapshots of that package. Even if we had consistently enforced that, it seems bad in principe to design a code loading system where this matters. The way we find code has to work for all registries, no matter how versioning in them is implemented or enforced. It also has to work for unregistered snapshots of packages. The one thing these all have is a tree hash. So that’s what we use.