I’ve been pondering an idea to optimize the storage footprint of Julia packages and would love to get your insights and opinions on it.
The Issue
While exploring various packages, I’ve noticed that directories like test, docs, examples, and notebooks often contain large files (images, binaries, etc.) that are not directly necessary for the functionality of the package itself. These files can significantly increase the total disk space used by installed packages, which might be an issue for users with limited storage or those who prefer a lean installation.
The Proposal
What if package authors could specify which directories are essential for their package’s operation, and only those directories would be included by default when users install the package? This could potentially reduce the installation size of packages by excluding non-essential directories (such as docs and examples) unless explicitly requested by the user.
The idea is not to limit the availability of these resources but to give users the option to install a minimal version of the package, focusing on the essentials needed for its operation. Additional resources could still be accessed by those interested, for example, by cloning the repository or opting in through the package manager.
Examples
Packages such as GR.jl and ColorSchemes.jl include substantial non-source files within their docs, examples, or test directories. While important for development purposes, these files may not be needed by all users, particularly those focused on using the packages’ functionalities rather than modifying or extending them.
Reduced Disk Space Usage: This approach allows for more efficient use of disk space, catering to users with limited storage or those who prefer a minimal installation footprint.
Enhanced Installation Efficiency: Streamlining the number of files to download can lead to quicker installation times and a more user-friendly experience.
Your Thoughts?
I’m curious about your thoughts on this idea. Do you think it’s a feasible enhancement for the Julia package ecosystem? Would it be beneficial for both package authors and users? Are there potential downsides or challenges that might arise from implementing such a feature?
Technically this is correct, but in practice we are talking about… megabytes, I guess. Sure, saving space is always nice, but is it worth the complication in practice?
I don’t understand why you are phrasing this as a benefit to package authors, when it may just mean having to fix something that worked before. Eg if a package keeps some binary tables in data/, not in src/, it would have to specify that explicitly in order to continue working.
Looking at ~/.julia/packages, I notice large unnecessary files in popular packages, for example:
test/testfiles in CSV, 28 Mb
docs/assets, assets, RPRMakie/examples in Makie, 81 Mb
examples in GR, 25 Mb
docs in ColorSchemes, 19 Mb
test/results in Sobol, 18 Mb
And even more packages with files of ~10 Mb or less.
These sizes add up for each installed version of each package. Still smaller than compiled/, but would be nice to have an easy way to avoid storing these files everywhere.
Also, to avoid downloading them in the first place.
Especially nice, if possible for packages from both pkgserver or git repos.
My immediate thought was to whether this could help reduce the size of compiled binaries (e.g. via PackageCompiler.jl) when a utilized package pulls in a lot of extraneous docs/testdata/etc, but I’m fairly certain those binaries wouldn’t include any of that anyway.
I could imagine this being useful for a package that for some reason want to include a large quantity of optional content and wants a flag to indicate that a user has opted-in to the full download… but you could currently accomplish this with git branches, e.g. add MyPackage vs add MyPackage#thewholeenchilada. That puts some extra work on the package maintainer to maintain parallel branches, but adding this to the package system likewise adds complexity to the whole package system.
I would like this feature. I have a package that has ~300kb of functional stuff vs. 170Mb of data that is required for testing. Yes, I could store the data somewhere else, but that would increase quite a lot the complexity of the development workflow.
It could be a line in Project.toml essential_dirs = [ "src" ].
I think for the most part the responsibility sits with package authors not committing those files. Perhaps one of the automatic checks during the package registration process could warn about large files.
I agree that maybe in the longer term this might be a good feature to think about. Lean vs complete package installations. I also had the thought that it could go in the Project.toml file (even before @lmiq edited his post to say that )
By the way, if someone wants to explain how one can develop a package, its docs, and its tests, having a some files in a branch and other files in another branch, without having to clone the package repo multiple times, or having multiple repositories, I’m open to suggestions.
Well, yes, that is indeed complicating the developers life. One alternative could be having a default artifacts directory which was not downloaded upon installation, then we could just put the heavy data there.
A package download from a package server is one single file, regardless of size. Moreover it is indexed by a content hash, which is the same as is stored in the registry, so packages are by design indivisible units. You could discuss filtering out directories when unpacking the package archive, but that doesn’t come without its own share of drawbacks.
There are primarily two options to store things in the repository without being included in the package:
Place them in some separate branch which is not used by package releases.
Place the package in a subdirectory of the repository and additional stuff outside of the subdirectory.
Regardless how you organize it, it’s still not a great idea to put large data in the repository, since this will hurt the developers with slow git clones and large disk usage. The only sound approach is to have large data in external sources and download it on demand.
I find the artifact amazing, particularly for shared files among projects. But for my usecase (a bunch of ~10mb files) required for testing the package, I don’t really see a cost/benefit advantage in adding that complication. At least I would have to disentangle the data structure of my tests from the data files, and that can be quiet messy. And the test files can get updated when I update the package, so I would have to separately update all artifacts…
This is the first time that I find some compelling reason not to store this data in the package repo. But I need a more compelling reason to dive into the artifact complications.
However, it’s more complicated than that. There are not only images for logos but also sometimes notebooks or data. It’s not intuitive for developers to put them in other branches, since they may not be found by users, not to mention that these packages often have many branches. And, if that works, we have to persuade some developers in the Julia community to follow this convention, which seems harder than changing the Pkg.jl behaviors.
Having essential and non-essential directories in a repo can lead to testing and installation issues. For example, it is easy to accidentally add an essential file to a non-essential directory. Then when running CI everything looks fine, but when users install the package it doesn’t work. This happened to me when I added a file to ensure zarr-python could read files from Zarr.jl but didn’t understand the details of how python packaging ignores some files (Run Zarr CI tests out-of-tree · Issue #1347 · zarr-developers/zarr-python · GitHub).