Downloading the data is not the problem, referencing the data across library installations is the problem. This is gonna dip into domain specific information but I think there’s enough in common for me to continue.
tl;dr DataDeps seems to address this problem, (thanks @giordano ) I speak about below. I still believe that “files” belong in the package repository. Continue reading if you want to know what I mean.
I am not an astrophysicist but I work with them, helping them build better software for GNC (Guidance Navigation and Controls) of spacecrafts. SPICE is a software toolkit for many things, in this case calculating geometry for mission planning, operations and analysis.
NASA provides what the toolkit calls “kernels”, best described by the project’s own extensive documentation as
“Kernel” means a file containing “low level” ancillary data that may be used,
along with other data and SPICE Toolkit software, to determine higher level
observation geometry parameters of use to scientists and engineers in planning
and carrying out space missions, and analyzing data returned from missions.
I suspect this toolkit was created in The Before Times when general purpose computers did not have (nearly) infinite resources for calculating planetary geometry, let alone cool high level languages like Julia to do it! Regardless, there is a large academic and professional community that knows how to use SPICE and depends on it for their results.
There are literally thousands of kernels available to the public from a NASA maintained database, searchable from a rather old-school looking website. For example, here are all the kernels (and future kernels because this mission is not over) for the OSIRIS-REX spacecraft mission data.
These kernels can run into the hundreds of megabytes in size with published quantities in the thousands. This makes it problematic to “store them all” in our own project, so to speak. NASA employs people to publish new kernels as new spacecraft enter orbits.
So…rather than thinking of this binary information as a file download, I’m thinking of it more like program data which is dependent on source code. It’s not really a shared library like that OpenSSL bug that I commented about not long ago. I believe if SPICE were rewritten, this information would be stored in some special purpose database and our Julia code would just…open a database network driver and ask for it. (side note, build this database. that sounds useful)
whew, still with me?
In the mean time, we have git LFS. It can store references to these “kernels” and allow us to commit them to source code just like any other file because they are files. Our source code can reference these files by path. Yea! But the size of the object is far greater than any reasonable person would want to include in the project history.
Since Julia uses git as the primary (exclusive?) interface for package management, it would make sense if…a SPICE.jl package would manage these large objects, so the user doesn’t have to. In some cases the user is a computer program running a ci pipeline with no human intervention.
Git LFS might not be the best solution but it’s one we have. Downloading kernels with custom code from a URL for every run (especially in CI) can be expensive for already expensive (hours long) simulation runs. In the mean time I’ll check out DataDeps and see if it solves this problem. Maybe I’ll also work on a micro-service setup between the SPICE functions that load big kernels and a network database that spits out results.