Code must be battle-tested before it becomes software. How should we test the libraries that we develop? What kind of tests are sufficient?
Suppose that we are developing a package of numerical solvers (for linear algebra, optimization, PDE, etc). I suppose this is the case for many programmers here. I keep PRIMA in mind at the time of writing. This is a package for solving general nonlinear optimization problems without using derivatives.
One may believe it suffices to test a few problems and observe whether the results are expected. If yes, then “the implementation is correct”. It seems that many people are happy with such a test. For me, this is a joke (sorry to say so, but continue to read).
I would like to elaborate a bit more about the importance of testing and verification, motivated by a conversation with a friend who uses PRIMA in his projects.
This friend refers to his projects as “critical projects”, which are directly related to the life and death of humans — imagine, for example, the designing of a new medicine (although this is not what my friend works on). The reliability of the solver and the reproducibility of the solution can never be exaggerated in such projects. This is quite different from the machine learning problems whose objective is only to decide how to post advertisements — nobody will die if the solver does something wrong. In critical projects, however, people may die.
So, what kind of tests are sufficient for PRIMA, which is designed for (and is being used in) critical projects?
No test will be sufficient. I can only tell that the following tests are necessary.
-
A large number (e.g., hundreds) of test problems, which can represent as much as possible different challenges that may occur in applications.
-
TOUGH tests. In applications, function evaluations may fail, it may return NaN or infinity from time to time, and it is very likely contaminated by noise. Any test is insufficient without trying such problems.
-
Randomized tests. It is impossible to cover all the possible difficulties with a fixed set of problems. Some bugs can only be triggered under very particular conditions that are “difficult” to encounter without randomization (a bug that is rarely triggered is still a bug!!!). Therefore, tests must be randomized, and the random seed must be changed periodically (daily or weekly).
-
Stress tests. If a solver is designed to solve 100-dimensional problems, then we must test it on (randomized) 1000-dimensional problems and make sure that it does not crash.
-
Automated tests. It is not enough to randomize the tests. Randomized tests must also be executed automatically every day and night, for example, using GitHub Actions.
-
Tests on various platforms under different systems using all compilers/interpreters available. Our software should not crash on any platform. Without thorough tests, the only thing I know is that we do not know what will happen.
-
A sufficiently long time of testing. In general, I do not feel confident about a solver if the accumulated testing time is below 10 years.
Comparing this with “testing a few problems and observing whether the result is expected”, I hope it is clear what I meant by “it is a joke” (sorry again for saying so). Recalling that the solvers may serve projects that decide human life, I guess it is clear why such a joke is not enough.
My experiences in past years have taught me three things.
-
I do not know what will happen in a particular case until I have made sufficiently many tests about it.
-
When I believe a test is stupid and unnecessary, the test will show me later that I am the stupid one.
-
When I believe that I know numerical computation and my code well enough, some tests will show me that I don’t.
PRIMA has been tested in this for more than 20 years, summing up the testing time of all the parallel tests. I insist that any porting/translation of PRIMA should go through the same level of tests. Otherwise, we cannot be sure whether it is proper.
I put testing and verification in the very center when developing (Indeed, I feel that many — if not most — libraries have not been sufficiently tested). Today (20240108), I received the following comment:
thank you for modernizing Powell’s solvers and taking verification serious. This is such important work!
I am delighted that my efforts in testing are appreciated. (You should check the cartoon).
How do you test the libraries that you develop?
[This is a copy of What kind of tests are sufficient for the porting or translation of PRIMA? · libprima · Discussion #39 · GitHub with slight adaptations.]