Git LFS objects could be installed with a package

There’s an open issue in Pkg.jl about some complexities when installing package dependencies that contain git LFS metadata for large objects. My use case is to require public NASA planetary data in a library that is a dependency in a number of orbital simulation projects that are private to my employer.

I worked around the issue (which I documented in a comment) but I think there could be a better solution than the one proposed. I just don’t know what it is yet. I’d like to open up a discussion about possible improvements for users of packages that depend on large objects. I feel like this use-case lies outside of what BinaryBuilder provides but I’m open to a dialog.

Is the data in tarball format? Then use GitHub - simeonschaub/ArtifactUtils.jl: A tool for package developers for working with artifacts. to fetch, otherwise look into GitHub - oxinabox/DataDeps.jl: reproducible data setup for reproducible science, but data doesn’t belong to the package repository itself.

Downloading the data is not the problem, referencing the data across library installations is the problem. This is gonna dip into domain specific information but I think there’s enough in common for me to continue.

tl;dr DataDeps seems to address this problem, (thanks @giordano ) I speak about below. I still believe that “files” belong in the package repository. Continue reading if you want to know what I mean.

I am not an astrophysicist but I work with them, helping them build better software for GNC (Guidance Navigation and Controls) of spacecrafts. SPICE is a software toolkit for many things, in this case calculating geometry for mission planning, operations and analysis.

NASA provides what the toolkit calls “kernels”, best described by the project’s own extensive documentation as

“Kernel” means a file containing “low level” ancillary data that may be used,
along with other data and SPICE Toolkit software, to determine higher level
observation geometry parameters of use to scientists and engineers in planning
and carrying out space missions, and analyzing data returned from missions.

I suspect this toolkit was created in The Before Times when general purpose computers did not have (nearly) infinite resources for calculating planetary geometry, let alone cool high level languages like Julia to do it! Regardless, there is a large academic and professional community that knows how to use SPICE and depends on it for their results.

There are literally thousands of kernels available to the public from a NASA maintained database, searchable from a rather old-school looking website. For example, here are all the kernels (and future kernels because this mission is not over) for the OSIRIS-REX spacecraft mission data.

These kernels can run into the hundreds of megabytes in size with published quantities in the thousands. This makes it problematic to “store them all” in our own project, so to speak. NASA employs people to publish new kernels as new spacecraft enter orbits.

So…rather than thinking of this binary information as a file download, I’m thinking of it more like program data which is dependent on source code. It’s not really a shared library like that OpenSSL bug that I commented about not long ago. I believe if SPICE were rewritten, this information would be stored in some special purpose database and our Julia code would just…open a database network driver and ask for it. (side note, build this database. that sounds useful)

whew, still with me?

In the mean time, we have git LFS. It can store references to these “kernels” and allow us to commit them to source code just like any other file because they are files. Our source code can reference these files by path. Yea! But the size of the object is far greater than any reasonable person would want to include in the project history.

Since Julia uses git as the primary (exclusive?) interface for package management, it would make sense if…a SPICE.jl package would manage these large objects, so the user doesn’t have to. In some cases the user is a computer program running a ci pipeline with no human intervention.

Git LFS might not be the best solution but it’s one we have. Downloading kernels with custom code from a URL for every run (especially in CI) can be expensive for already expensive (hours long) simulation runs. In the mean time I’ll check out DataDeps and see if it solves this problem. Maybe I’ll also work on a micro-service setup between the SPICE functions that load big kernels and a network database that spits out results.

You’re looking into the wrong solution for a real problem. If you’re understandably concerned about saving CI cycles, the solution is to cache the data (see for example GitHub - julia-actions/cache: A shortcut action to cache Julia artifacts, packages, and registries. if using artifacts), not to clog your repository. If using artifacts, the tarballs should also be duplicated by the PkgServer (at least gzipped tarballs, I’m not sure about the other compression formats), for persistency of the data in the juia ecosystem.

I’m not worried about clogging anything but thank you for your concern. This video led by a nice person describes this exact problem (it’s one of reproducible science) and how DataDeps solves it. Looking forward to pulling that into my project and giving it a try.

Still think git LFS in packages would (sort of) work the same way.

I know you aren’t (clogging the repository is your solution), I am :slightly_smiling_face: