Computational reproducibility of Jupyter notebooks from biomedical publications

Shamelessly copied from the Research Computing Teams newsletter. Relevant as Julia should help greatly with reproducibility.

Computational reproducibility of Jupyter notebooks from biomedical publications - Sheeba Samuel & Daniel Mietchen

This is a fun paper, which required a really impressive automated pipeline to do, and I think people are drawing the wrong conclusions from it.

Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones.

This is pretty cool! Over 22.5k notebooks from almost 3.5k papers were tested, to which, well, hats off. Super cool work.

There’s a lot of focus in discussions of this work on the fact that 22.5k notebooks, only 1.2k ran. I don’t think that’s anything like the biggest problem uncovered here.

For a one-off research paper, the purpose of publishing a notebook is not so that someone else can click “run all cells” and get the answer out. That’s a nice-to-have. It means that researchers that choose to follow up have an already-working starting point, which is awesome, but that’s an efficiency win for a specific group of people, not a win for scientific reproducibility as a whole.

There are people saying that we should have higher standards and groups who publish notebooks should put more people time into making sure the notebooks reliably run on different installs/systems/etc. That’s a mistake. For the purposes of advancing science, every person-hour that would go into doing the work of testing deployments of notebooks on a variety of systems would be better spent improving the papers’ methods section, code documentation, and making sure everything is listed in the requirements.txt, and then going on to the next project.

The primary purpose of publishing code is as a form of documentation, so that other researchers can independently reproduce the work if needed. But we know for a fact that most code lives a short, lonely existence (#11). Most published research software stops being updated very quickly (#172) because the field doesn’t need it. And trying to reimplement 255 machine learning papers showed that a clearly written paper and responsive authors were much more significant factors for independent replication than published source code (#12). If others are really interested in getting the notebook to run again, then presumably the problems will get fixed up, and the problems will be resolved. The fraction of those notebooks that will be seriously revisited, however, is tiny.

To me, the fact that 6,761 notebooks didn’t declare their dependencies is a problem because it represents insufficient documentation. That 324 notebooks ran and gave the wrong answers is a real problem, because it means there was some state somewhere which wasn’t fixed (again, an issue of documentation). That 5,429 notebooks couldn’t still have all the software installed isn’t, to me, a problem much worth fixing, nor is (necessarily) that 9,1815 notebooks installed everything but didn’t run successfully (depends on why).

Less controversially, this has been out for a while but I hadn’t noticed - Polyhedron’s plusFORT v8 is free for educational and academic research. It’s a very nice refactoring & analysis tool that even works well with pre-Fortran90 code. Most tools like VSCode or Eclipse shrug their shoulders and give up older or non-F90 Fortran code, even though that’s the stuff that generally needs the most work. If you try this, share with the community how well it works.

The value of using open-source software broadly is something like $8.8 trillion, but would “only” cost something like $4.15 billion to recreate if it didn’t exist, according to an interesting paper. That 2000x notional return on notional investment suggests the incredible leverage of open source software. Also, only 5% of projects/developers produce 95% of that value, something that would likely be seen in research software as well.


Many thanks for this interesting reference! Did the paper mention (the number of) Julia notebooks? (I understand they did only Python execution testing).

It could be interesting to do same for Julia - but somebody has got to invest time do do it …

Thankyou @jdad I did think I should write an article for Jonathan’s RCT Newsletter on how Julia helps with reproducible science.

That needs a sharp stick applied to my posterior.