Github contingency planning — should Julia be backing up issues/PRs?

Reading Cory Doctorow’s article on the inevitable en-:poop:-ification of for-profit websites made me think about how closely the development of Julia (and many other free/open-source projects) is currently tied to github.com. What is the contingency plan if Microsoft (gasp) turns evil?

Yes, people can move their git repos elsewhere, but that wouldn’t preserve the issues and PR discussions, which are an invaluable record of development history. And if Github were, hypothetically, to suddenly change its APIs to lock in developers and make migration harder, it might be too late to easily grab everything.

Even for a project the size of JuliaLang/julia, the entire archive of issues and PRs couldn’t possibly be more than a GB or so. Wouldn’t it be prudent to institute regular backups? What’s the best tool for this, python-github-backup?

39 Likes

:100::100: This concept has really changed my perspective since I read that. It’s nothing I didn’t know, but having the framework to slot all of the different examples into really makes it concrete.

At least at the moment, these things are preserved if migrating to GitLab. Not sure about Gitea and codeberg and the like, but it seems plausible that they would. It’s also possible to export basically everything to JSON. Could be a good idea to make this part the release process or something (perhaps just minor releases and not every patch)?

4 Likes

Tools like Gitea have forge-specific importers

which (usually with an API access token) allow for importing of issues, PRs, and more

I’ve recently used this to turn a bunch of my GitHub repos into GitHub mirrors of the canonical selfhosted repo. You can see an example of the final result here: https://git.tecosaur.net/tec/emacs-everywhere.

That said, it appears this may be harder with huge repos. For instance, see Gitea hosted Gitea · Issue #1029 · go-gitea/gitea · GitHub where the last comment contains:

The alternative export API that Github provides is sadly not working for us, it always returns that it fails to export the data. My guess is that the amount of data we have is too much.


It’s also worth noting there’s a soon-ish-to-be-merged PR for syncing issues/PRs/etc. from a pull mirror Sync issues/PRs etc from pull mirror by harryzcy · Pull Request #20311 · go-gitea/gitea · GitHub, which should help lead to bidirectional syncing in the future.

2 Likes

I know it’s possible to export everything now. But it seems unwise to rely on that being possible indefinitely. A process for regular complete backups seems like something it would be good to make routine well before there seems to be any concrete reason to worry.

For the purposes here, I don’t think relying on yet another commercial service (gitlab, gitea, …) is a good idea. You want a regular backup, using free/open-source tools, to open formats like JSON etc.

5 Likes

I don’t think relying on yet another commercial service (gitlab, gitea, …)

Indeed, that’s why I suggested Gitea, because it’s not a commercial service. It’s a self-hostable FOSS forge, and once the ForgeFed effort comes to fruition you’ll be able to get an F3 (Friendly Forge Format) form of the repos, which is a relatively new open standard for forge data (see https://forum.forgefriends.org/t/about-the-friendly-forge-format-f3/681 / https://f3.forgefriends.org), and the sync PR I linked to is effectively “a regular backup, using free/open-source tools, to open formats like JSON etc.”.

Is there a bit of “soon™” going on here? Yes, but it’s the closest thing to a good solution that I’m currently aware of, and the ForgeFed/F3 effort is moving along.

3 Likes

Thanks for the clarification — I was confused by the .com. They offer commercial support for a FOSS platform, which is fine.

But I don’t think we have to move immediately to self-hosting. Just have a regular backup of data so that we could switch to self-hosting at any time without a huge loss of information.

2 Likes

But I don’t think we have to move immediately to self-hosting. Just have a regular backup of data so that we could switch to self-hosting at any time without a huge loss of information.

Yea, it would be a bit premature in some respects. To be safe in the knowledge you actually have the backup in a format that could be used for self-hosting in the future though, I think you’d either want to have:

  • A live pull sync with everything you want backed up — in a way it’s the ultimate way of being sure you can self-host at any time … already be quietly doing so
  • A F3 archive, but the standard is still setting down so that’s probably a bit hard ATM (IETF draft is in the works: Submission status of draft-f3-format-00)
  • Some other format that can latter be massaged to F3 or otherwise imported into some self-hosted forge

Thanks for the clarification — I was confused by the .com. They offer commercial support for a FOSS platform, which is fine.

NP. Last year a company for commercial support was formed, and there was a bit of community uproar because all of the project branding IP (name, logo, etc.) was transferred to the commercial entity instead of a non-profit which the commercial entity was affiliated with. As a result there’s an active fork now called Forgejo which the largest Gitea site (codeberg.org) has switched to and the ForgeFed team is as I understand it working on that fork ATM.

python-github-backup (linked above) has a GitHub API throttling feature that may help with this.

In practice, I think the backup tool’s robustness and completeness is more important than the format. My feeling is that any reasonably structured open format can be massaged into any other open format if needed.

3 Likes

Just popping by to say that it would sure be nice to back up the packages from the general registry in addition to Julia itself, but idk how feasible / legal it is licensewise?

2 Likes

Most of them probably already are in terms of Git history at least, see https://www.softwareheritage.org/.

I quickly checked a few packages I registered this year or the last, and they’re not on there, so presumably the backup is partial

1 Like

I share your concerns, but I don’t think that a backup is the best solution in this case.

Backups without verified restore and integrity compared to the original can easily turn out to be useless. For a filesystem backup, this is easy to do, but for Github metadata, this is more or less equivalent to doing the full migration to a self-hosted solution.

So, I would propose that effort is spent on finding a self-hosted solution instead, and migrating while we can. For a typical project, CI would be a pain point (Github provides it for free), but my understanding is that Julia already moved that out of Github.

1 Like

Have you heard the phrase “the perfect is the enemy of the good”?

Moving to a self-hosted solution requires a lot more resources than simply running a few-line python-github-backup script (or similar) occasionally and occupying a bit of disk space. Even if the backup is not perfect — even if it loses 5% of the issue/PR information — it would still be a huge improvement over having nothing. (The only backup that needs to be perfect is the code, but git already gives us that.) And insisting on a self-hosted solution or nothing means that most projects will have nothing.

6 Likes

While you make a valid point, the concern is not about the backed up data being perfect, but about it being being garbage. That can easily happen down the line with an automated solution.

Nevertheless, you are right, even a backup from a few months ago or similar would be valuable.

FWIW I don’t think a gitea self-hosted mirror will take nearly as much as you seem to think. I’d think a €5/month hetzner VM would be sufficient, and low-maintenance.

1 Like

And then we trust the ransomware specialists won’t encrypt everything! I have seen how important backups are.

This is already done, I believe. A JuliaHub package server contains a copy of all packages in the Julia General registry. Search this forum for a response to the “leftpad” incident.

1 Like

My completely naive and probably stupid thought: could such a backup be “hosted” via a distributed (and possibly partitioned) torrent? Community members provide the storage au gratis, multiple copies of the backups help ensure redundancy, etc.

1 Like

I second Gitea. I have been running it for a few months on Hetzner and it has been very reliable. Noteworthy point is that the source is also hosted on GitHub.

Quite ironic in some way, but I believe here lies also the problem with self-hosting. The barrier to contribution is increased and general visibility decreased. Maybe a centralized platform based on donations, such as Wikipedia or the Signal messenger could be an option. Not sure if that exists already and how the financing works out. I guess I should look at what people have posted earlier (codeberg exists).