Slow installation of mono-repo

I am maintaining a privately registered git repo which contains multiple packages. The repo is starting to become rather big ~300 MB (I know I should do something about that). Whenever a user installs a package the whole repo needs to be cloned. Even if the repo has already been cloned, the package manager will still make a new clone. This results in multiple package installs taking a long time to download since its N x 300MB which needs to be downloaded.

I am looking for some advice on how to manage this. The few things I came up with myself:

  • Reduce git repo size (difficult/dangerous, not a permanent solution)
  • Move packages to single git repos (this may not be desired for us)
  • Find a way to reuse the cloned git repo to install multiple packages?

Is there something else I can do to not make installation of our packages that heavy? I notice that installation from General registry packages is often very fast, so I assume that they use some smarter tricks?

3 Likes

The main thing that makes installation of packages from General different is that packages are normally distributed by package servers instead of by cloning git repositories. There are ways to distribute your own packages through a package server too, e.g. GitHub - GunnarFarneback/LocalPackageServer.jl: Julia storage and package server for local packages.

That said, there are probably optimization opportunities for this use case both for the package manager itself and for LocalPackageServer.

1 Like

Are you storing data in the git repo or why is it so large? If so, you should definitely move the data out, as git is best for code only. For comparison: the pytorch repo is 1.2GB in size with complete history. If you want faster clones you can restrict the depth like so: git clone --depth 1 , which only clones the latest version. This for example reduces the size to 331MB for pytorch.

2 Likes

Unfortunately, we do indeed have some test data in our repository. I think over time that has caused the bloating. When cloning manually the –depth 1 trick will work, however, when the package manager takes over I think it does a full clone and checkout of the git hash associated with the version of the package to install.

One full clone is maybe not even so bad, but when multiple packages in the repo need to be installed it needs to clone the full repo every time. Maybe there could be a way for the package manager to recognize it has a full clone already in the depot clones folder?

Support for having a package in a subdirectory of a repository was added much later than the mechanisms used for the clones folder so it’s likely that there are things that can be improved, but with package servers now being the main mode of distribution of packages, it’s also likely to be fairly low in priority for the Pkg developers.

Why not using lazy artifacts instead?