Pkg downtime incident

Earlier today, several users started seeing issues installing packages. This post seeks to collect all the information related to this incident.

Impact

The issue caused installation of incorrect versions (latest master when a prior version was requested) of packages.

  • Versions of Julia prior to 1.4 will silently install the wrong version
  • Windows versions of Julia 1.4.x will also silently install the wrong version
  • Non-Windows versions of Julia 1.4.x will issue a warning and fall back to git to obtain the correct version
  • Julia 1.5 is unaffected when using the pkg server (which is the default), otherwise matches 1.4 behavior

The issue has since been mitigated in the registry, so you were only affected if you were attempting package operations on an affected version between approximately 2pm Eastern and 3:43pm Eastern when the mitigation went into effect.

Symptoms

Installing the wrong version of a Julia package can cause incorrect behavior in several different ways. Perhaps the most common will be inscrutable package dependency errors, but more subtle behaviors are possible. If you performed a package operation today, you may want to see the mitigation section below as a precaution.

Mitigation

If an incorrect package version was installed, it will be locally cached until removed. As such, if you believe you were affected, it is advisable to clear your package cache by deleting .julia/packages. Note that your list of installed packages will not be affected and you may re-download all installed packages in your current environment by using Pkg.instantiate().

Root cause

The root cause of this change was an unannounced serverside change by GitHub, which broke download of tarballs by git-tree-hash, e.g. previously https://api.github.com/repos/JuliaLang/MbedTLS.jl/tarball/2d94286a9c2f52c63a16146bb86fd6cdfbf677c6 would give the tarball for that tree-hash, while it now gives the tarball for master instead. We do not yet know whether this change was intentional or not. The reason this change broke Pkg is that Pkg includes a heuristic where it will use the tarball download feature instead of a full git checkout as faster way to download a requested version (since it no longer needs to download the full repository with all its history). This was special cased for github.com and does not affect packages hosted elsewhere (though the vast majority of packages are currently hosted on GitHub).

Registry workaround

The above mentioned workaround was https://github.com/JuliaRegistries/General/pull/18991/files, which changes the URL for all registered packages from github.com to GitHub.com. This breaks above mentioned heuristic and will force older versions of Julia to fall back to a full git checkout instead. This method is slower, but should yield the correct package version. Note that Julia 1.5+ is unaffected and downloads via the Pkg server will continue to be fast.

Additional considerations/General registry updates paused

We have contacted GitHub to find out whether this change was intentional and is likely to persist. If so, we will need to update Registrator and the validation CI to force packages registered at GitHub to use the same GitHub.com workaround we manually applied to the registry. If not, the workaround will be reverted as soon as GitHub restores the original behavior (to get back to faster package download speeds on older versions). In the meantime changes (new packages/version bumps) to the General registry are paused. They will be resumed once either of the two options have been completed.

Future considerations

As noted, Julia versions 1.5+ are not affected due to the Pkg server work (which was partly motivated by a desire to avoid incidents like this once). However, such Julia versions will still fall back to raw GitHub downloads if the package server is unavailable for some reason (broken, blocked by corporate firewall, we forgot to pay our bills, etc.). In the near future, the validation currently present on non-Windows versions, will be extended to Windows version, such that even with a broken package server, the fall back path would itself fallback to Git if it is being served incorrect tarballs (the same verification will of course extend to the package server also). This change has been planned for some time and the requisite support is already available in Tar.jl, but has not yet been wired up in Pkg.

72 Likes

This is “clever”. :slight_smile:

2 Likes

Fight failure with failure, as they say.

18 Likes

Is the following related to this?

 Warning: Due to a previously reported error, the running code does not match saved version for the following files:
│ 
│   /home/usernamehere/.julia/packages/Flux/IjMZL/src/layers/basic.jl
│ 
│ Use Revise.errors() to report errors again.

Note, I’m not running Revise at all(must be a depend somewhere)…

Possibly, but unlikely.

1 Like

Is it possible to switch to a shallow clone? I imagine this would be a bit slower than a tarball, but substantially better than cloning a full history.

1 Like

GitHub requests people don’t do this because the full clone path is special cased while the shallow clone case uses a slow path that causes lots of CPU load.

4 Likes

“2pm Eastern to 3.43 Eastern”. I appreciate this email was probably rushed out ASAP, but does that mean Eastern Standard Time (UT-5) or Eastern Daylight Time (UT-4)?

Then again, being on UT+9.5 myself, I was fast asleep either way :slight_smile:

The problem is that we cannot change the Julia client in any way because people already have it and we can’t change code on their computers. Otherwise we could just remove this behavior or allow it to be overridden by the registry.

1 Like

It means the time that people on the east coast of the US say on their clocks :wink: - in this case Eastern Daylight Time.

3 Likes

I am on Julia 1.5. How do I know I am using the new Pkg protocol? Is there a warning printed if that fails and it falls back to the old Github downloads?

1 Like

Has github been made aware of this? Since its a publicly accessible API, surely this must have been unintentional?

The original URL is being redirected to

https://codeload.github.com/JuliaLang/MbedTLS.jl/legacy.tar.gz/2d94286a9c2f52c63a16146bb86fd6cdfbf677c6

Seems like they either want to deprecate tarball downloads or do something else with them since they’re calling them legacy.tar.gz?

Funnily enough, this sudden change in behaviour goes directly against their recently published roadmap on how they want to roll out new features…

1 Like

Yes, GitHub is aware. Our present understanding is that they’ll be rolling back this change, but we’re awaiting explicit confirmation to that extent

7 Likes

Actually, I think that means it is fixed now on GitHub’s side. Previously it resolved to https://codeload.github.com/JuliaLang/MbedTLS.jl/legacy.tar.gz/master but now it has the tree hash in the URL and the downloaded file seems to contain the correct files.

7 Likes

The workarounds have been reverted and everything should be back to normal now.

7 Likes

As long as the git server supports it (GitHub seems to), you can shallow clone a commit through

mkdir <repo> && cd <repo>
git init
git remote add origin <url>
git fetch --depth 1 origin <ref/sha>
git checkout FETCH_HEAD

Might be more viable that to handcode tarball urls for different git servers.

4 Likes

Thanks for your quick response and hard work to mitigate this.

13 Likes

Update on this:

  • GitHub has rolled back the change, so the API for downloading tarballs by tree hash is no longer serving incorrect tarballs.
  • The registry workaround hack has been reverted so people using older Julia versions will get (correct) tarball downloads again.
  • Auto-merge on the General registry has been turned back on.

In short: everything is back to normal now.

GitHub was very responsive about this and we’re very grateful to them for that. These kinds of things happen when you run a service and the best that one can do is to react quickly when something goes wrong.

The major lesson here is that the Pkg client should not rely on baked in knowledge of 3rd party APIs in a way that cannot be overridden remotely by making metadata changes to the registry. Of course, had GitHub simply discontinued this API and returned a 404, then everything would have been fine, since Pkg would try it, fail and fall back to cloning the package repo. What caused the problem was that the API seemed to continue working while returning the wrong content, which is a failure mode that’s pretty hard to anticipate. Fortunately, with the Pkg protocol in 1.5, we no longer rely on 3rd party APIs like this.

33 Likes

Pkg.update shows where it downloads the registry. In my case, the switchover to the new protocol was not automatic. See

3 Likes