Reproducibility: What's the risk of a dependency becoming unavailable?

floswald · May 7, 2024, 11:05am

Hi all

I’m talking a lot about reproducibility of computational results in my role as Data Editor of an Economics journal. I’m aware that julia is better positioned than most (all) languages on that front, and I’m trying to tell people that much. There is one nagging question that comes back to me from time to time, and I’m looking for a good answer. Again, I am not trying to advocate any change in how julia packages work at the moment (god no), just looking for a better answer.

Q: So Pkg is great and all, but most dependencies are decentralized over the internet. So, what happens the day when Author X of package Y decides to take their package down. If you don’t have a copy of this repo somewhere, your project becomes unreproducible.

The background of this question is often that other languages (R mostly) have a central repository where stuff gets backed up at least to a certain extent, and in julia world we don’t (at least that I know of).

My answer is something along the lines is

the only guarantee in open source world for this is to distribute all source code with a replication package, i.e. you would include all dependencies in your package, basically providing and offline version of the project.
provide a docker container with all packages installed (very similar to first answer)
of course that’s a pretty big cost for user-friendliness, and Pkg is much better.

So: has this actually ever happened that a package just disappeared and projects broke? I would have thought that’s extremely rare, but be interested in some views on this.
thanks

nilshg · May 7, 2024, 11:24am

I think Stefan’s comment here addresses your question?

https://lwn.net/Articles/874250/

floswald · May 7, 2024, 11:32am

it does indeed! thanks!

Tamas_Papp · May 7, 2024, 11:56am

Kudos for taking this seriously. A lot of journals just don’t care beyond ticking boxes. I have seen a lot of econ papers (even from the past 5 years) that

have an online/technical appendix on the author’s website (which may disappear, or become out of date, especially as people move and/or rebuild their web pages),
have a tarball for the code on either the journal webpage or a personal webpage, which is incomplete and/or no longer runs.

floswald · May 7, 2024, 12:07pm

we’re trying! But honestly things are changing fast ATM, so at least it’s moving in the right direction.

foobar_lv2 · May 7, 2024, 12:20pm

Is there an archive strategy for intentional removals by pkg.julialang.org?

e.g.

A package/version/artifact was found to be backdoored
A package/version/artifact was found to have license issues / received DMCA takedowns
A package/version/artifact was found to be legally/morally poisonous in most jurisdictions for other reasons

DMCA is likely to take down artifacts from S3. Western jurisdiction has trouble touching Kazakhstan, as evidenced by sci-hub; more generally pirate bay also exists.

So ideally we’d seed decentralized backups, to encourage that dependencies are “available with hassles and without regards for legality for scientific forensics” forever – especially if julialang.org takes future action to make an old artifact unavailable, e.g. because they are legally compelled to do so.

davidanthoff · May 7, 2024, 7:02pm

Not directly in response to the original point by @floswald, but somewhat related: when I’m preparing the replication code for a paper, I typically use GitHub - davidanthoff/BundleProjects.jl: Bundle Julia projects to vendor some of the packages into the archive that gets attached to the journal publication. We often write papers where some of the core scientific code for the publication lives in a package (e.g. in a modeling paper, the core model code itself), and I’ve always felt that this kind of code should not live somewhere on a package server, but actually be part of the “official” archive for the paper, and that gets very easy with that package. Also makes it very simply to create replication archives that can be used during paper review that contain code that might not yet be public.

floswald · May 7, 2024, 7:42pm

very nice! but so, you do that on the last day of development, when you know that you will no longer by ] up ing anything and you want to freeze the library into what is in the paper?

davidanthoff · May 7, 2024, 8:54pm

Yes, ideally we do that when we’ve just finished the final run that produces the results that we actually show in the paper, and then we zip up that folder, upload it to zenodo and include the DOI to that archive in the manuscript that gets published. Here is an example where we did that.

The way I’ve used it during the review process is that I created a bundled version of the code when we submit the initial manuscript or a revision, zip things up, put it on Google Drive and create an unlisted link. I then include that link in the manuscript in the “Code and Data section” and in brackets I write something like “[This link will be replaced with a permanent DOI link at the time of publication]”. But that way reviewers actually have access to the replication code while reviewing the paper. That whole last part is probably silly, I very much doubt that any reviewer ever looked at any code, but at least in theory they could

floswald · May 8, 2024, 6:40am

Yes indeed that is where I want to end up one day: all reviewers having access to code. Currently It’s like “proofs in appendix A” and don’t show appendix A.

Topic		Replies	Views
Are some packages "official", in some sense? General Usage package	5	795	May 1, 2019
Dependency policy - should we avoid dependencies or embrace them? General Usage dependencies	20	2285	February 5, 2018
Saving current julia/package information for reproducible research General Usage question , package	7	1471	December 12, 2017
Julia in Linux distributions Community	45	6216	January 21, 2019
Distinguishing projects from packages Internals & Design packages , code-organization , project	38	5414	March 23, 2018

Reproducibility: What's the risk of a dependency becoming unavailable?

Related topics