What kind of tests are sufficient: Some personal thoughts

Code must be battle-tested before it becomes software. How should we test the libraries that we develop? What kind of tests are sufficient?

Suppose that we are developing a package of numerical solvers (for linear algebra, optimization, PDE, etc). I suppose this is the case for many programmers here. I keep PRIMA in mind at the time of writing. This is a package for solving general nonlinear optimization problems without using derivatives.

One may believe it suffices to test a few problems and observe whether the results are expected. If yes, then “the implementation is correct”. It seems that many people are happy with such a test. For me, this is a joke (sorry to say so, but continue to read).

I would like to elaborate a bit more about the importance of testing and verification, motivated by a conversation with a friend who uses PRIMA in his projects.

This friend refers to his projects as “critical projects”, which are directly related to the life and death of humans — imagine, for example, the designing of a new medicine (although this is not what my friend works on). The reliability of the solver and the reproducibility of the solution can never be exaggerated in such projects. This is quite different from the machine learning problems whose objective is only to decide how to post advertisements — nobody will die if the solver does something wrong. In critical projects, however, people may die.

So, what kind of tests are sufficient for PRIMA, which is designed for (and is being used in) critical projects?

No test will be sufficient. I can only tell that the following tests are necessary.

  1. A large number (e.g., hundreds) of test problems, which can represent as much as possible different challenges that may occur in applications.

  2. TOUGH tests. In applications, function evaluations may fail, it may return NaN or infinity from time to time, and it is very likely contaminated by noise. Any test is insufficient without trying such problems.

  3. Randomized tests. It is impossible to cover all the possible difficulties with a fixed set of problems. Some bugs can only be triggered under very particular conditions that are “difficult” to encounter without randomization (a bug that is rarely triggered is still a bug!!!). Therefore, tests must be randomized, and the random seed must be changed periodically (daily or weekly).

  4. Stress tests. If a solver is designed to solve 100-dimensional problems, then we must test it on (randomized) 1000-dimensional problems and make sure that it does not crash.

  5. Automated tests. It is not enough to randomize the tests. Randomized tests must also be executed automatically every day and night, for example, using GitHub Actions.

  6. Tests on various platforms under different systems using all compilers/interpreters available. Our software should not crash on any platform. Without thorough tests, the only thing I know is that we do not know what will happen.

  7. A sufficiently long time of testing. In general, I do not feel confident about a solver if the accumulated testing time is below 10 years.

Comparing this with “testing a few problems and observing whether the result is expected”, I hope it is clear what I meant by “it is a joke” (sorry again for saying so). Recalling that the solvers may serve projects that decide human life, I guess it is clear why such a joke is not enough.

My experiences in past years have taught me three things.

  1. I do not know what will happen in a particular case until I have made sufficiently many tests about it.

  2. When I believe a test is stupid and unnecessary, the test will show me later that I am the stupid one.

  3. When I believe that I know numerical computation and my code well enough, some tests will show me that I don’t.

PRIMA has been tested in this for more than 20 years, summing up the testing time of all the parallel tests. I insist that any porting/translation of PRIMA should go through the same level of tests. Otherwise, we cannot be sure whether it is proper.

I put testing and verification in the very center when developing (Indeed, I feel that many — if not most — libraries have not been sufficiently tested). Today (20240108), I received the following comment:

thank you for modernizing Powell’s solvers and taking verification serious. This is such important work!

I am delighted that my efforts in testing are appreciated. (You should check the cartoon).

How do you test the libraries that you develop?


[This is a copy of What kind of tests are sufficient for the porting or translation of PRIMA? · libprima · Discussion #39 · GitHub with slight adaptations.]

4 Likes

Nice writing and in general I agree.
Some remarks came to mind during reading and maybe you like to elaborate on them.

Your primary quality criteria of testing is accumulated time of running the different types of test, at least 10 years, better have 20 years like you do. My feeling here is, that this is not a good measure at all, because the time used for tests all depends on the quality of the test routines and the space of input available to the library/function to test. If the input space is huge and multi dimensional even 20 years of testing time may show to be insufficient. Vice versa a well defined small input space can be tested exhaustively in a few minutes on a very complex calculation.

it suffices to test a few problems and observe whether the results are expected

Of course, this is never sufficient and I doubt that many people are believing this. It’s more because of a lack of knowledge about testing and perhaps a lack of a good criteria for good testing habits.

In your case and example of PRIMA I would be interested in some kind of estimation about the costs of development, especially in the relation of the costs to implement the algorithms and, on the other side, the costs of implementing and running the tests regularly. A estimation of such a ratio could perhaps be a better criteria for good software testing quality. Despite this general thought I would be interested in your concrete numbers for PRIMA as an example, if you are able to provide such estimations.

We are a certified (ISO 17025) laboratory and are dealing with health data of cancer patience. Some software I write has to comply the our certified development process which comprises validation of the functionality. Without going into details this is quite a problem, because the nature of our work is more scientific than a routine of well established methods. The software are analysis pipelines which reimplement manufacturer GUI software, which was not possible to be used as an automatic pipeline. Our tests and validations of the software therefor have two main foci: first, it must resemble the results of the manufacturer software (>99% similarity = validation) and, second, equal input must produce equal output if software or computers change over the years (except if the change of output is planed which triggers new validation). These criteria are automatically tested at every stage and continuous over time. If any deviation is detected, data delivery processes are stopped immediately and analysis of failure starts. We do this since (about) 2010 (certification since 2014).

In the above scenario I can’t estimate the absolute runtime of the tests, but I can say that every analysis run has about 3 times of tests running. So, if the pipelines would have been running non stop since 10 years, we would have 30 years of tests (very roughly).

The development time of analysis pipelines versus implementing continuous tests is about equal if we ignore the time to develop the certified development process. If we include this into the test development, the ratio heavily shifts to the tests, I can’t estimate it good enough to say any numbers, as there have been many people involved where I can’t say how much their work went into the software part. The costs explode even more, as moving under a ISO certification is really expensive.

With our group have to deal with real patience data regularly, the criteria, that people may die because of software failure, suddenly became prevalent. This made the ISO certification a requirement. This made a difference that it wasn’t enough anymore that I know that the analysis pipelines work correct, but that I have to show prove for correctness. Before it was enough, that data (and scientific results) passes the scientific peer review process.

Back to general topic of effort of tests, I think it all depends on the risk a software creates. A general library should not be tested as if people die from it failures, but the software which uses this library and risks peoples life should be tested like this and it should test all its libraries as well. We do this by continuously testing that same input always results in identical output.

2 Likes

yes, testing really drives home the limitations of your imagination. It fosters a violent expansion of the mind when your test fails, like a psychedelic journey. We learn this truth: All expectations at write time are illusion. All expectations at compile time are illusion. The tests themselves are only a mirage, they too must be tested. All is maia, flickering chaos, only briefly working on the current build.

2 Likes

It is unclear how results from an optimization library specifically would be a matter of life and death. Hopefully the users realize that even if the implementations of various methods are 100% certified bug-free, there is no global method that guarantees the optimum for all optimization problems, even if some mild conditions are assumed.

Testing is an investment, the optimum depends on a lot of factors. Some libraries need a lot of tests, some fewer. It all depends on the resources and the costs.

I am curious how you are doing that with a BSD license.

5 Likes

Great insights!
I’ve recently run into a bug in my code where I assumed that if x < y then sqrt(x) < sqrt(y). Turned out, the chances of encountering a counterexample in practice were non-negligible.

2 Likes

Could you share the counterexample, because it is mind-boggling.

Found a counter-example on first try:

julia> x = floatmin(Float64)
2.2250738585072014e-308

julia> y = nextfloat(x)
2.225073858507202e-308

julia> sqrt(x) < sqrt(y)
false
6 Likes
julia> 399.99999999999994 < 400.0
true

julia> sqrt(399.99999999999994) < sqrt(400.0)
false

julia> sqrt(399.99999999999994)
20.0

Note that 399.99999999999994 == prevfloat(400.0).
The reason is that \sqrt{1 - \varepsilon} \approx 1 - \varepsilon / 2 so that sqrt(prevfloat(x)) may differ from sqrt(x) by less than machine epsilon, i.e. the same as sqrt(x) from the floating-point POV.

8 Likes

Thanks both, very good lesson about the floating danger.

4 Likes

Very well said! I hold the same opinion but cannot express it in such a vivid way :slight_smile:

Here is a video of a TOUGH test: https://x.com/historyinmemes/status/1749511750020403222?s=20 :upside_down_face:

I learned similar things when working on PRIMA, which is coded in modern Fortran in the current version.

  1. Invoked with aggressive optimization flags, compilers may not guarantee the commutativity of floating-point addition / multiplication, even though IEEE 754 ensures that.

  2. 1 / Inf may be evaluated to NaN.

  3. 1.0E+37 / 1.0E+38 may be evaluated to 0.

  4. If a or b is NaN, max(a, b) (the maximum between a and b) may be different from maxval([a, b]) (the maximum of the vector [a, b]).

These behaviors are compiler-specific, and, surprisingly enough, they are not considered bugs …

1 Like

It is unclear how results from an optimization library specifically would be a matter of life and death.

For me, it is not difficult to imagine that the numerical solution to an optimization problem may be part of a life-death decision.

Hopefully the users realize that even if the implementations of various methods are 100% certified bug-free, there is no global method that guarantees the optimum for all optimization problems, even if some mild conditions are assumed.

Totally agree. Some human beings must check the solution before it is applied.

Testing is an investment, the optimum depends on a lot of factors. Some libraries need a lot of tests, some fewer. It all depends on the resources and the costs.

Of course. What I wanted to say was that “testing a few problems and observing whether the result is expected” is barely sufficient if any serious testing is intended.

I am curious how you are doing that with a BSD license.

I mean, any porting/translation that I am actively involved in. If the developer does not see things in a similar way, then I would not be involved in such a project in the first place.

Thanks.

1 Like

2 and 3 really should be bugs without fastmath (at least they would be in any sane language). Note that this type of issue is exactly why Julia doesn’t support global --fast-math flags any more. Local fast math is much safer and more predictable.

2 Likes

I think you misunderstand my point: in general, optimization problems do not have algorithms that guarantee a (global) optimum. Some special cases exist (eg convex problems), but generally no matter how excellent and bug-free an implementation is, for every general algorithm you can devise a problem where the global optimum is not found.

Because of this, one cannot “check” the solution to a generic problem in the absolute sense, just apply some heuristics (eg multiple solvers from random starting points) and hope for the best.

A corollary is that overtesting algorithms which are themselves imperfect is rarely feasible or desirable.

We all like to believe that our work matters, but perhaps this is taking it too far :wink:

1 Like

I think you misunderstand my point: in general, optimization problems do not have algorithms that guarantee a (global) optimum. Some special cases exist (eg convex problems), but generally no matter how excellent and bug-free an implementation is, for every general algorithm you can devise a problem where the global optimum is not found.

No, I didn’t. I know your point very well. Indeed, even for the algorithms that guarantee a global solution in theory, and even if such an algorithm is applied to a convex problem, you most likely only obtain an approximate solution within finite time — even this is usually based on the assumption that your code is doing real-number calculation, whereas it really performs only floating-point arithmetic. However, this is irrelevant to the topic discussed here.

We all like to believe that our work matters, but perhaps this is taking it too far :wink:

Thank you for pointing this trivial fact out, although it is again a topic unrelated to the discussion.

Thanks.