Unstable sha256 sum for Artifact hosted in another Github repo

Hi, guys. I am trying to setup an Artifact and running into problems with sha256 checksum: every time it tries to download the file the checksum is different.

The tarball sits in another Github repository and I reference it using the permalink with the commit hash in the url. I have tried to repeatedly download it with wget and compute checksum, and every time the result is different. Should not the sha256 be sensitive only to the contents of the file and not its metadata? Has anyone encountered such problem before?

If you want to play around with it, here is the file https://github.com/Gregstrq/Isotope-data/blob/9dd2180ba3bc6caf67063a59601af3e0960edbae/isotopes_data.tar.gz. I check the sha256 sum with

using SHA
bytes2hex(open(sha256, "isotopes_data.tar.gz"))

Are you sure you’re actually downloading the raw file? Or are you just downloading GitHub’s HTML page that displays it? That is, you want to use the URL of https://github.com/Gregstrq/Isotope-data/raw/9dd2180ba3bc6caf67063a59601af3e0960edbae/isotopes_data.tar.gz — note that that has a /raw/ in there instead of the blob.

1 Like

Depends on what you mean with content and metadata. The hash is computed from all the bytes in the file, regardless of their semantical meaning. It would be possible for GitHub to create a different file every time, differing on the MTIME field in the gzip header, but it’s highly unlikely that they would want to do that. @mbauman’s explanation is almost certainly correct.

I do really like the pattern where artifacts are tagged and official GitHub release assets. It takes a bit to setup, but it has some very nice properties (like versioning, traceability, and reproducibility). I’ve been meaning to create a blog post about what I’ve found to be best practice here, but here’s a recent example:

GitHub promises to not mess with the bit-exact value of releases: Update on the future stability of source code archives and hashes - The GitHub Blog

2 Likes

Yes, I had blob instead of raw.

According to the Github blog, the usecase for releases seems to be related to git archive function and specifically if you depend on this archiving for security reasons.

In my case, I don’t create archive automatically and it does not have to do anything with source code. I just want to share a specific DataFrame as an Artifact using a tarball, that I generated myself (not automatically).
Using Releases seems like an overkill for my specific usecase…

That’s just a case where they changed the bit-exact-ness of a tarball and is what prompted the post (and promise). Release assets are a general purpose tool to deploy pretty much anything up to 2GB. Yeah, that can take work to setup, but it looks like you do already have a script to generate this tarball. You can have GitHub Actions run it for you and automatically update your Artifacts.toml. Then it’s a single click to update when your upstream data sources update.

You can of course host artifacts anywhere. This is just one mechanism that GitHub provides that I find valuable.

1 Like

I have a script that generates the file that then is being compressed into a tarball. Although, I run the script myself.
Also, the file with generating script is one repo, and the Artifacts.toml that uses the file is in the other repo. Can I have a Github Action that triggers something in a different repository?

IMV one of the biggest advantages here is that you can host the artifacts directly alongside the package itself — all in one repository. It not only feels very official, but I’ve seen that it can make some security reviews more straightforward because they then don’t need to vet a possibly-different host/platform/user/permissions/etc.

In the case of PlotlyKaleido, the GitHub action is Julia code directly in the YAML, stored directly in the package repo itself:

In this case, it’s proxying the upstream .js files without any intervening munging, but you could just as easily download the IUPAC/whatever data and then run the ETL scripts you need. By being part of the same repo, it can then open a pull request to update the Artifacts.toml file. You could also give the releases a human readable version number like v2025.08.01.

2 Likes

In fact, I believe that if you’re just fetching tarballs that aren’t releases then GitHub will sometimes regenerate them with different metadata and potentially different compression, so you cannot rely on the sha256 hash of the entire tarball being stable unless it’s a release. Matt’s suggestion seems like the way to go.

1 Like