Computational reproducibility of Jupyter notebooks from biomedical publications

johnh · February 20, 2024, 8:34am

Shamelessly copied from the Research Computing Teams newsletter. Relevant as Julia should help greatly with reproducibility.

Computational reproducibility of Jupyter notebooks from biomedical publications - Sheeba Samuel & Daniel Mietchen

This is a fun paper, which required a really impressive automated pipeline to do, and I think people are drawing the wrong conclusions from it.

Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones.

This is pretty cool! Over 22.5k notebooks from almost 3.5k papers were tested, to which, well, hats off. Super cool work.

There’s a lot of focus in discussions of this work on the fact that 22.5k notebooks, only 1.2k ran. I don’t think that’s anything like the biggest problem uncovered here.

For a one-off research paper, the purpose of publishing a notebook is not so that someone else can click “run all cells” and get the answer out. That’s a nice-to-have. It means that researchers that choose to follow up have an already-working starting point, which is awesome, but that’s an efficiency win for a specific group of people, not a win for scientific reproducibility as a whole.

There are people saying that we should have higher standards and groups who publish notebooks should put more people time into making sure the notebooks reliably run on different installs/systems/etc. That’s a mistake. For the purposes of advancing science, every person-hour that would go into doing the work of testing deployments of notebooks on a variety of systems would be better spent improving the papers’ methods section, code documentation, and making sure everything is listed in the requirements.txt, and then going on to the next project.

The primary purpose of publishing code is as a form of documentation, so that other researchers can independently reproduce the work if needed. But we know for a fact that most code lives a short, lonely existence (#11). Most published research software stops being updated very quickly (#172) because the field doesn’t need it. And trying to reimplement 255 machine learning papers showed that a clearly written paper and responsive authors were much more significant factors for independent replication than published source code (#12). If others are really interested in getting the notebook to run again, then presumably the problems will get fixed up, and the problems will be resolved. The fraction of those notebooks that will be seriously revisited, however, is tiny.

To me, the fact that 6,761 notebooks didn’t declare their dependencies is a problem because it represents insufficient documentation. That 324 notebooks ran and gave the wrong answers is a real problem, because it means there was some state somewhere which wasn’t fixed (again, an issue of documentation). That 5,429 notebooks couldn’t still have all the software installed isn’t, to me, a problem much worth fixing, nor is (necessarily) that 9,1815 notebooks installed everything but didn’t run successfully (depends on why).

Less controversially, this has been out for a while but I hadn’t noticed - Polyhedron’s plusFORT v8 is free for educational and academic research. It’s a very nice refactoring & analysis tool that even works well with pre-Fortran90 code. Most tools like VSCode or Eclipse shrug their shoulders and give up older or non-F90 Fortran code, even though that’s the stuff that generally needs the most work. If you try this, share with the community how well it works.

The value of using open-source software broadly is something like $8.8 trillion, but would “only” cost something like $4.15 billion to recreate if it didn’t exist, according to an interesting paper. That 2000x notional return on notional investment suggests the incredible leverage of open source software. Also, only 5% of projects/developers produce 95% of that value, something that would likely be seen in research software as well.

jdad · February 20, 2024, 10:34am

Many thanks for this interesting reference! Did the paper mention (the number of) Julia notebooks? (I understand they did only Python execution testing).

It could be interesting to do same for Julia - but somebody has got to invest time do do it …

johnh · February 20, 2024, 11:19am

Thankyou @jdad I did think I should write an article for Jonathan’s RCT Newsletter on how Julia helps with reproducible science.

That needs a sharp stick applied to my posterior.

Topic		Replies	Views
Testing jupyter notebooks CI Tooling	5	818	September 10, 2019
What's a user-friendly way to setup repositories for Jupyter notebook tutorials? General Usage tutorials	4	1466	October 20, 2017
Best practice: modules and scripts for publishing General Usage	6	2156	August 22, 2019
Write articles in and about Julia for nextjournal.com Jobs plotting	3	1802	December 27, 2018
[ANN] NBTesting.jl: unit testing for IJulia Community	5	1155	March 27, 2017

Computational reproducibility of Jupyter notebooks from biomedical publications

Related topics