Here is a conundrum I have about unit testing for our brilliantly conceived Julia packages (and unit testing in general)
Let us assume that I have invented an amazing algorithm to compute the meaning of life. As a good Julian, I write a package MeaningOfLife.jl to implement my amazing algorithm. Let us further assume that both the algorithm and my Julia implementation thereof are minimally complex. As a good citizen, I begin to write some unit tests.
Here’s the rub: In order to verify that the call meaning_of_life(l::Life) produces correct results, I have to implement the algorithm in the test module. But by assumption, the algorithm is minimally complex. By further assumption, my Julia implementation to be tested is a minimally complex implementation thereof. Therefore because the tests are also written in Julia, all I am really doing in the unit test module is writing the same code over again, with at least as much complexity in the second implementation. If bugs follow a Poisson process, then my unit test is at least as likely to be wrong as what it is testing!
Thus, we see that unit tests are apparently useless. This brings up a further paradox: It does not seem reasonable that unit tests are useless, given that they are so widely used. What am I missing?
For most practical problems there are (a) test cases where the results are known to high accuracy (either from analytical solutions or from other implementations) and/or (b) slower “brute-force” calculation methods to compare against for small problems.
For really huge problems like climate simulations, people validate by comparing with (a) experimental data, (b) other independent implementations of simulation codes (typically using a variety of algorithms), and (c) “micro” tests of individual simulation components. Google “model validation” and you will see lots of discussion.
To add to this, there are many different kinds of tests.
Unit tests traditionally refer to a test that tests very small bites. You do some example by hand on the paper, and test that functions give the exact value that you know they should, given you calculated the solution out by hand. If you have these for each (or most) of your functions, then you know that your functions do what you think they do.
Yes, you need to really check your analytical results. But you can usually pick things which are “known facts”, and make them redundant. For example, testing a function which calculates a derivative, you test it with many things where you have already computed the results. Divide them up into classes of tests which test individual parts (chain rule). If only a single test in a class fails, you know you screwed up. An entire class should fail if the function is wrong. This kind of redundency helps.
Then you have integration tests. These string a bunch of tests together to make sure the “whole thing” works. Convergence tests with an analytical solution, testing against data for some amount of accuracy, etc. fall into this category. Again, you are trying to hit some analytic truth you know through other methods, and to be safe it should be redundant.
Lastly, you have regression tests. If you can’t write a unit test on something but you know it works through integration tests or other means (plotting), then you can test on some equation where you don’t know the analytical solution, but you pull the test value from how it was working currently. Yes, this has the assumption that it actually is working, but that is okay if you have an integration test that relies heavily on it. This test will tell you if this function ever changes the values it’s calculating, which is very good to know even if you aren’t 100% sure it’s calculating the right thing.
Making good tests is an art form. Making good tests that don’t take up years of CI time is for gods.
Agreed. But test cases where the results are known only show that the function works for particular inputs. We can’t show that it works in general by testing specific cases.
For the sake of argument, let’s assume that computing the meaning of life, the universe, and everything qualifies as a “really huge” problem. So, I write unit tests for meaning_of_life(), meaning_of_the_universe(), and meaning_of_everything() and a third test that adds their results. Checking with experimental data, I find that my result should be 42, but (let’s say) it is actually 43.02657. I know that I messed up, but per my original argument, I don’t know where. The only recourse I have is to come up with a third implementation, probably on paper or in my head. And I have no assurance that it right, either! And if one run takes 10 million years (or even, for smaller problems, a couple days), that’s a lot of computer time.
I’m not saying that unit tests are useless – on an intuitive level, the ones I write “feel” useful to me. I am trying to figure out the conflict between the observation that they “feel” useful and what seems like a convincing argument that they aren’t.
That’s true, but the point of unit testing is not to give a formal proof of correctness. Constructing such proofs for every bit of code you write would be infeasible or at the very least, extremely unproductive. Instead, the goal, I think, is to find a good balance between productivity right now on the one hand, and on the other hand confidence in the results produced by the code and future productivity. I mention future productivity, because 1) unit tests often catch regressions early that would otherwise result in very long debugging sessions, and 2) unit tests give you the confidence to make rather drastic changes that can get you out of a local minimum of usefulness or performance.
To your specific example, even if the unit test is more likely to be wrong than the code, it can be useful, because it may fail in different cases or fail in different ways. Any discrepancies can be investigated by a competent human, and hopefully either the test or the code is fixed as appropriate.
I use unit test as a guardian when refactoring codes. Ususally in development of functions, I may first write simple versions, adding tests on basic functionalities and examples. Then when rewriting for a fancier version, it is reassuring to see the old tests pass.
I agree, I often see tests as a way to keep results consistent (not necessarily correct) during development, and to ensure fixed bugs won’t be brought back by future changes. Having an extensive test suite really gives you a lot of freedom (from worries) during development.
This isn’t a unit test, it’s an integration test. The fix for this problem would actually be unit tests, which would test each little meaning_of_life() etc. function individually, and not their sum. The more of these little units you have well-tested, the easier it is to identify the source of any issue.
@ChrisRackauckas, you are right. This is indeed an integration test. But the issue I was driving at holds, even with a combination of unit and integration tests.
Skimming through the answers above, the general theme seems to be: Tests are useful not because they ensure code correctness, but because they are an automated way to check the consistency of our work, both between the package and the test, and between versions of the package. This makes sense.
Of course, if we are competent at what we do (which many of us are), consistency is evidence for correctness.
A more pragmatic reason for unit tests is that Julia is JIT-compiled. Until a function is called many of the reasons for immediate failure (typos, etc.) will not be diagnosed. Whether initial testing to eliminate such things are “unit tests” or not is rather subjective. So-called test-driven development stresses simple tests from the get-go.
Yes. The fact that Julia is JIT-compiled means that unit tests can be critical.
One thing I don’t think anybody’s mentioned here, is the importance of using coverage tools along with unit tests, to ensure that every code path has been executed at least once.
That helps catch the things that would normally be caught in a static language, although just having unit tests that give 100% coverage still doesn’t give any guarantees of correctness.
In my experience though, unit tests that try to hit as many edge conditions as possible, in combination with getting 100% (or as near as possible) coverage, are my best defense against bugs in the code.