Dropping Github history from a published package

Someone kindly pointed out to me that my package StatisticalRethinking.jl is severely bloated. It contains more than 1Gb of garbage in the .git/objects/pack directory (multiple copies of a few pretty old iJulia notebook .ipynb files I used in v1, many other v1 notebooks are no longer present).

I have tried to remove those files and rewrite the Github history (basically following many of the recipes available online to do this), to no avail. The last resort would be to delete the repo, create a new one and make the current content of the repo the initial commit (using the new .git subdirectory).

Before doing that I wanted to check if there are any other suggestions. Losing the history is not a major concern “content wise” but I’m not sure about the JuliaHub tools being able to continue with the next version (v4.4.2). I don’t think there are other published packages that depend on StatisticalRethinking.jl except a few StatisticalRethinking Github org Julia projects.

Refer to Git - Maintenance and Data Recovery (section Removing Objects). That might be of some help.

Note that if you rewrite the history by changing the content of release revisions, which would affect their git-tree-sha1, you’d be making your own package uninstallable (if that’s registered)

1 Like

Thanks Petr and Mosè!

Mosè:

Have you seen an example of someone merging a history-free new version of a package? Or is there simply no remedy for this?

Petr:

I’d attempted the Git book’s approach but after the rewriting of the git history the file was corrupted. That is when I went online and tried SO suggestions (which are all variations on what the book suggested).

I think it is the rewrite step which does complete after several minutes that corrupts the file (it already warns about bugginess):

rob@Rob-16-MBP-2 StatisticalRethinking % git filter-branch -f --index-filter \
  'git rm --ignore-unmatch --cached notebooks/03/clip-02-05.ipynb' -- e0ec1390^..   
WARNING: git-filter-branch has a glut of gotchas generating mangled history
	 rewrites.  Hit Ctrl-C before proceeding to abort, then use an
	 alternative filtering tool such as 'git filter-repo'
	 (https://github.com/newren/git-filter-repo/) instead.  See the
	 filter-branch manual page for more details; to squelch this warning,
	 set FILTER_BRANCH_SQUELCH_WARNING=1.
Proceeding with filter-branch...

@giordano @dilumaluthge

Maybe the simplest solution is to move the contents of StatisticalRethinking.jl to a new package StatisticalRethinkingBase.jl (as since StatisticalRethinking.jl v2 the package is only intended to support a number of Julia projects).

Over time StatisticalRethinking.jl will become a general intro for the other packages in the StatisticalRethinking Github organization.

I remember that DIfferentialEquations.jl went through something similar years ago due to large PDFs in documentation, or something like that. Maybe @ChrisRackauckas could advise on the best approach.

1 Like

The best approach is to just never make a repo have that problem :sweat_smile:. However, note that these days the repo is not downloaded by users, it only sends the release version, and so you don’t really need to worry about the repo’s history for the package usage.

If you want to fix the history for other reasons, well you could do a BFG repo clean and do a PR to General fixing all of the SHAs

2 Likes

We would not accept that PR. We do not accept PRs that change the tree hashes.

Probably the easiest thing for you to do here is to make a new package named StatisticalRethinking2.jl that contains only the files you want.

Then, add a deprecation notice to StatisticalRethinking.jl and archive the repository.

It’s not uncommon in the Julia ecosystem for packages to be deprecated and archived in favor of a new package.

1 Like

This doesn’t really make sense, you cannot (by definition) change the content of a release (because the release is addressed by the content itself). As long as the content exists somewhere in the repo you are fine. Keeping the content available is as easy as e.g. having a tag for each released version. At that point you can do whatever you want with the history of the master branch.

3 Likes

Wouldn’t your repo lose all its stars though?

(asking for a friend… :slight_smile: )

2 Likes

Thanks everybody for the very helpful comments. I’ll do what I indicated above (and similar to what @dilumaluthge suggests).

StatisticalRethinking.jl will remain around as an anchoring point and overall README. This will preserve the stars (@cormullion).

As I did with Stan.jl for the StanJulia Github organization, it will have no additional functions (these will all go to StatisticalRethinkingBase.jl), just GitHub organization type docs and StatisticalRethinking.jl will continue to be used for overall testing (e.g. functionality comparisons between Stan and Turing or showing new options such as the recently released ParetoSmooth,jl, AxisKeys.jl and DimensionalData.jl packages).

As @ChrisRackauckas pointed out (and I didn’t know), the size issue is only a problem if the package is dev-ed.

Thanks again all!

1 Like

That’s probably a good policy :sweat_smile: .

Is there in dev something equivalent to the depth git option, to avoid downloading the full history?

I believe if you do

git clone --depth 10 https://github.com/StatisticalRethinkingJulia/StatisticalRethinking.jl

Julia does not recognize the package. I think that matches what Kristoffer said above?

2 Likes