Reproducibility: What's the risk of a dependency becoming unavailable?

Hi all

I’m talking a lot about reproducibility of computational results in my role as Data Editor of an Economics journal. I’m aware that julia is better positioned than most (all) languages on that front, and I’m trying to tell people that much. There is one nagging question that comes back to me from time to time, and I’m looking for a good answer. Again, I am not trying to advocate any change in how julia packages work at the moment (god no), just looking for a better answer.

Q: So Pkg is great and all, but most dependencies are decentralized over the internet. So, what happens the day when Author X of package Y decides to take their package down. If you don’t have a copy of this repo somewhere, your project becomes unreproducible.

The background of this question is often that other languages (R mostly) have a central repository where stuff gets backed up at least to a certain extent, and in julia world we don’t (at least that I know of).

My answer is something along the lines is

  • the only guarantee in open source world for this is to distribute all source code with a replication package, i.e. you would include all dependencies in your package, basically providing and offline version of the project.
  • provide a docker container with all packages installed (very similar to first answer)
  • of course that’s a pretty big cost for user-friendliness, and Pkg is much better.

So: has this actually ever happened that a package just disappeared and projects broke? I would have thought that’s extremely rare, but be interested in some views on this.
thanks

5 Likes

I think Stefan’s comment here addresses your question?

https://lwn.net/Articles/874250/

4 Likes

it does indeed! thanks!

Kudos for taking this seriously. A lot of journals just don’t care beyond ticking boxes. I have seen a lot of econ papers (even from the past 5 years) that

  1. have an online/technical appendix on the author’s website (which may disappear, or become out of date, especially as people move and/or rebuild their web pages),

  2. have a tarball for the code on either the journal webpage or a personal webpage, which is incomplete and/or no longer runs.

we’re trying! But honestly things are changing fast ATM, so at least it’s moving in the right direction.

3 Likes

Is there an archive strategy for intentional removals by pkg.julialang.org?

e.g.

  1. A package/version/artifact was found to be backdoored
  2. A package/version/artifact was found to have license issues / received DMCA takedowns
  3. A package/version/artifact was found to be legally/morally poisonous in most jurisdictions for other reasons

DMCA is likely to take down artifacts from S3. Western jurisdiction has trouble touching Kazakhstan, as evidenced by sci-hub; more generally pirate bay also exists.

So ideally we’d seed decentralized backups, to encourage that dependencies are “available with hassles and without regards for legality for scientific forensics” forever – especially if julialang.org takes future action to make an old artifact unavailable, e.g. because they are legally compelled to do so.

Not directly in response to the original point by @floswald, but somewhat related: when I’m preparing the replication code for a paper, I typically use GitHub - davidanthoff/BundleProjects.jl: Bundle Julia projects to vendor some of the packages into the archive that gets attached to the journal publication. We often write papers where some of the core scientific code for the publication lives in a package (e.g. in a modeling paper, the core model code itself), and I’ve always felt that this kind of code should not live somewhere on a package server, but actually be part of the “official” archive for the paper, and that gets very easy with that package. Also makes it very simply to create replication archives that can be used during paper review that contain code that might not yet be public.

12 Likes

very nice! but so, you do that on the last day of development, when you know that you will no longer by ] up ing anything and you want to freeze the library into what is in the paper?

Yes, ideally we do that when we’ve just finished the final run that produces the results that we actually show in the paper, and then we zip up that folder, upload it to zenodo and include the DOI to that archive in the manuscript that gets published. Here is an example where we did that.

The way I’ve used it during the review process is that I created a bundled version of the code when we submit the initial manuscript or a revision, zip things up, put it on Google Drive and create an unlisted link. I then include that link in the manuscript in the “Code and Data section” and in brackets I write something like “[This link will be replaced with a permanent DOI link at the time of publication]”. But that way reviewers actually have access to the replication code while reviewing the paper. That whole last part is probably silly, I very much doubt that any reviewer ever looked at any code, but at least in theory they could :slight_smile:

3 Likes

Yes indeed that is where I want to end up one day: all reviewers having access to code. Currently It’s like “proofs in appendix A” and don’t show appendix A.

1 Like