How do tree hashes in registry Versions.toml work?

I’m trying to create a local registry/package mirror and I’m having trouble understanding exactly how the git-tree-sha1 hashes in Versions.toml work. They do not seem to correspond to git commit hashes but instead represent some kind of non-commit tree object. I haven’t been able to find much documentation on how to go from known git commit hashes to trees or vice-versa, yet Pkg.jl must be doing that somehow. I know you can use git cat-file on a commit hash to get some kind of tree hash but checking that out doesn’t seem to give the expected result (ie. a working directory that is the same as that obtained from checking out the commit itself).

Could someone point me to any documentation or specific code that would clarify what is going on here ? Thanks.

If I’m not mistaken, this is relevant:

Though I’d just use LocalRegistry.jl instead of rolling something new.

1 Like

Roughly speaking a tree hash is a hash of the contents of a git repo whereas a commit hash is a hash of the contents, commit metadata, and history.

The easiest way to find how commit hashes relate to tree hashes at the top level of a repo is to run

git log --pretty="%H %T"

Not really. Pkg only needs tree hashes.

Thanks, I think that gives me the last piece of the puzzle I needed. I knew how to go from a commit hash to a tree hash but not the reverse (not without scanning every commit in a repo). That log command looks like it gives a reasonably easy way to obtain the reverse mapping.

The biggest difficulty of going from tree hash to commit hash is that it’s not necessarily a unique mapping. Depending on your workflow you may very well find the same tree hash both for a commit on a feature branch and for the corresponding merge commit to main/master. Also if you decide to revert a commit you will get back to the same content and thus the same tree hash.

2 Likes

For a historical perspective, Julia’s pre-1.0 package manager used commit hashes and it was a terrible mistake. Using commit hashes forces packages to be downloaded with full git history in order to verify that they are correctly installed and haven’t been modified. It also permanently ties package versions to that git history. But we don’t actually care about history, we only care that the code that is installed and used is correct and what was expected—the development history isn’t relevant when simply using a single package version. The identity of the content alone is precisely what the tree hash captures. It can be and is verified without git, by independently implementing the same algorithm git uses. Moreover, the Pkg system is designed to allow for other tree hash algorithms in the future, such as git-tree-sha256 or even some other approach to content hashing entirely.

11 Likes

Are we talking about the code located below?

Yes. There’s also an implementation in the Tar package: https://github.com/JuliaIO/Tar.jl/blob/master/src/extract.jl#L206-L279.

The Tar version is actually much better because it’s not subject to the brokenness of various file systems which make at very challenging to reliably compute the expected tree hash everywhere.

1 Like