Consistency testing when unit-testing is infeasible

Unit-testing is a great concept. Unfortunately, when you are dealing with complex scientific code, it is usually impossible to formulate the tests since you don’t know exactly what are you going to obtain (with the exception of very rare edge cases). All in all, if you can write up the output analytically, you do need to resort to programming in the first place.

In the end, the scientific code gets tested, albeit in indirect manner: you check that the results look reasonable, that the results have correct asymptotics. May be you know the analytical answer in a couple of edge cases. However, in comparison with unit tests, it is more like integration testing: you need to run the system as a whole and examine the final results.

On the other hand, if you have a battle-proven implementation (which you thoroughly tested by examining the final results), it seems that you actually can cook up a sort of unit testing: when you change the implementations of the functions, you can test the new implementations against the old ones.

Finally, after this long introduction, let me tell you my question: What is the best way to organize this sort of consistency testing? Is it possible somehow to use two different versions of the package in one test file?

May be the more feasible option is to copy the battle-proven implementations into a separate module which you use only for testing. What do you think about it?


I use a number of unit and integrative testing in my C++ package for fluids simulation channelflow.

  1. Unit tests on component functions. E.g. the 3d solver uses a 1d Helmholtz equation boundary-value solver, so I test the 1d solver on a variety of test problems.
  2. In those unit tests, test convergence towards known analytic solutions and appropriately small residuals in artificially generated test problems. E.g. for an artificial 1d Helmholtz problem u'' - \lambda u = f plus boundary conditions, generate some random BCs and smooth functions f(x), solve for u(x), and then verify that the norm of the residual is small, \|u'' - \lambda u - f\|/\|f\| < \epsilon_{tol}.
  3. For integrative testing, concoct some test case that can be specified from a small initial data set and that reaches a solution that can be easily characterized. E.g. my code computes 3d, nonlinear equilibrium and periodic-orbit solutions of Navier-Stokes. So from the solutions I’ve computed with a trusted version of the code, I choose a few that are smooth and small in terms of discretization, and from them devise relatively simple initial guesses that converge towards those solutions when fed to the solver. (By simple I mean small enough to be specified in a small text file.) The integrative testing then runs these initial guesses through the solver and checks the norm of the residual and a few global (norm-like) properties of the solutions to verify that the solver reached the solution I intended.

In my case, I do Monte Carlo simulations of dynamical correlation functions of lattice spin systems. If I follow your example, I could test the convergence of the Monte Carlo averaging as the number of the samples grows. However, the error in Monte Carlo typically scales with the number of samples N as 1/\sqrt{N}. As a result, to achieve the relative tolerance of \varepsilon, you need to average over \varepsilon^{-2} runs. So, getting a decent relative tolerance requires quite a few runs.
To minimize the influence of statistics, the refence data should be generated with a very good relative tolerance. Then, when one does the convergence testing, it is good to achieve a good relative tolerance as well. So, doing convergence testing looks like a bit of a hassle in my case.

On the other hand, if you think about it, my case is quite different from yours. The point of my research is the approximation scheme for exact dynamics of a quantum system. I only solve ODE with varying initial conditions. The novelty in my case is in how I define the right-hand-side function for this ODE. As a consequence, I can simply compare the different implementations of this right-hand-side function. I do not need to test the algorithm of ODE integration itself.

To summarize, I would like to discern the following two scenarios of scientific code testing:

  • One needs to test the algorithm itself. For this, one needs to run the alrgorithm and do some sort of convergence testing. On the plus side: one can choose some very simple problems for testing.
  • One needs to test the generation of data structures which are plugged into well-established algorithms. In this scenario, it is useful to organize the sort of consistency testing I was talking about: comparing different implementations of generation of these data structures.

you check that the results look reasonable

This is a dangerous approach. The errors you want to catch with unit testing are often the small errors that may still produce “reasonable” results in some scenarios. The large ones will mostly be obvious to you anyway.

However, in comparison with unit tests, it is more like integration testing: you need to run the system as a whole and examine the final results.

You will probably find it is possible, and likely desirable to break up your problem into pieces that make some sense on their own and can be tested separately. There is often some code that is all math/formulation, and some that does something mostly technical operations to run the formulation. You can separate these - and test simple cases using the formulations, and test the technical code with very simple formulations.

Integration testing is also good!

May be you know the analytical answer in a couple of edge cases

These edge cases are invaluable - once you have unit tests of your components - to ensure that there aren’t bugs in how you have joined them together.

Is it possible somehow to use two different versions of the package in one test file?

Use multiple dispatch. Organise your formulations using structs and types, and you can just drop in alternate (often simplified) versions of them, that you define in the test suite. This is how I do nearly all of my model testing.


Generally you can test at least the components, and usually you can test the whole calculation using simplified models, invariants, conservation laws — depending on the problem. Testing scientific calculations usually requires some thought and investment, but it is always worth it, especially if you are optimizing for speed so you can catch problems as they arise.

Generally, make your project a project with Project.toml, and then put the tests in test/.

I don’t think I understand the question. Generally you cannot load two different versions of a package easily.


If you have a version of the package (with a specific set of dependencies) whose results you trust, you can do as if they were results obtained from another source that you take as “truth” (different software, textbook, published papers…): run that “good” version once, and hard-code those results in the tests.

Or if writing them down in the source file is not practical, make a script that prints the output in a text file, run it with the “good” version and save the file. Then modify runtest.jl to make it print the output in another temporary file, and compare the contents of both.

1 Like

This might be helpful:


This is true.
I was lazy in GitHub - oxinabox/LayeredLayouts.jl: Layered Layout Algorithms for Directed Acyclic Graphs
and was like “I can eyeball a graph layout and see if it minimizes number of crossings”.
And I was absolutely wrong.
(In this case i was ignoring that i should have been pathing through dummy nodes.
This has given me caution about eyeballing things to see if they look reasonable.
Not enough to add proper tests to that package, which is just for fun. But caution.

I am planning on writing a blog post about testing scientific code some time in the next month.
A lot of which is in my current draft is basically what looks what @John_Gibson said in Consistency testing when unit-testing is infeasible - #2 by John_Gibson

On your question of

Finally, after this long introduction, let me tell you my question: What is the best way to organize this sort of consistency testing? Is it possible somehow to use two different versions of the package in one test file?
May be the more feasible option is to copy the battle-proven implementations into a separate module which you use only for testing. What do you think about it?

If this is the way you want to go.
It is not possible to load two versions of the same package in a julia process.
you could load them in a seperate process, and use remote_call to communicate to it.
But it would be kind of fiddly.
Still not much more fiddly then if you had a reference implementation in C or Python or Mathematica that you were comparing against.

I think though your option of putting it in a seperate module is a better one.
(Or perhaps just a seperate type).
Ideally, you would have a reference implementation that is much simpler (so you had more confidence that it had no bugs), and more robust, at the cost of being slower (since you shouldn’t need to care about speed its only testing).
Further it would ideally be based on a different algorithm all together, to eliminate the possibility that a shared misunderstanding exist between the implementation and the reference.
This is what autodiff packages are doing when they test against finite differencing. It’s much more robust (though less accurate), and much simpler, and based on a different algorithm.

Having it in still in your project means if your reference is bugged, then you can at least fix it.


This is called golden master testing, and is quite easy to implement: just test

@test f(some_input...) ≈ known_result  rtol=some_tolerance

(I don’t think it’s productive to try to call the old f(...) at runtime in the test. Just hard-code the old “good” value.) It’s not a good way to validate a new program, of course, but it is good insurance against regressions of some feature once it is validated somehow.

For new programs/features, in many cases there is a well-tested program somewhere that handles at least some special cases, and it is quite useful to bootstrap by testing against other software in this way (in addition to reproducing analytical solutions or analytical properties of the solutions). Again, you can just hard-code the results of the other trusted software in your test file.


I totally agree with you. I guess I was a bit lazy and did not want to compute Hamiltonians and other quantum operators manually. In my defense, I should say, that I did not eyeball anything, I compared the results with known analytical edge cases and with experiments. In the case of experiments, I didn’t expect the results of the simulations to coincide completely with the results of experiment, and by “looks reasonable” I meant really “agrees reasonably with experiment”.

I think using old f(...) is really unproductive indeed. I think I can make unit tests for very small spin lattices, and reinforce it by “golden master testing” for more realistically sized systems.

One technical point here. For example if you want to add a test that runs one or two steps of a simulation to compare with verified results, a stable random number generator can be useful: GitHub - JuliaRandom/StableRNGs.jl: A Julia RNG with stable streams


Hi Greg - Julia supports your idea to have multiple sub-modules in a project, and that seems like a great approach to me. You could call it trunk-based development or call it regression testing, but it would clearly allow you to address your biggest risks.

May I ask what’s the hardest part of the question for you right now? You seem to know the mathematical domain and to know about Julia’s system. Is it a matter of deciding what kinds of tests to put where? Or what other people would recognize as good decisions?

And one random idea: variance reduction techniques can be a great help for regression testing Monte Carlo. It’s a little bit of a pain to set up something like common random numbers (CRN), but it’s wicked fast once it runs.

Golden master testing is clearly insufficient, but the simplicity of writing @test f(x) == 1.452348594 makes it worthwhile in a lot of contexts.

Unit tests will never cover a whole pipeline under all possible conditions. If @test full_pipeline("some_real_data_file") == 34.24123 fails, but all unit tests pass, that gives me a lot of information to identify where the problem can be.

1 Like

I suppose it is the last one: to know what other consider as a good decision. Also, I thought that the topic of scientific code testing is just an interesting topic for discussion and for exchange of opinions.


I think it is best to make as much of the code as possible generic and reusable, instead of specific to a project. This often makes testing easier and frequently results in better code organization, too.

Julia makes this particularly easy: packages are lightweight, and committing the Manifest.toml to the repository for the actual calculation/project helps with reproducing the exact environment. Also, from a practical perspective, CI for the code that was factored out to packages won’t be rerun each time something else changes.

1 Like

Yes, it is interesting. On the one hand, you’re treating your code like the subject of an experiment, like in the book The Design and Analysis of Computer Experiments. On the other hand, it can be as flatly wrong as an if-then missing an else. This thread has mentioned several approaches to minimize risk.

John’s “consistency” checking comes in many forms. You could ensure Lipshitz continuity of a result near a critical point, monotonicity, or a smooth approach to limiting values. There is a consistency check called metamorphic testing that looks for expected symmetries among arguments and results: given f(x)=y, find some g,h such that f(g(x))=h(y).

The “golden master” is also one of many kinds of checks against other implementations. You could compare Julia to an R implementation, a stochastic version to its continuum counterpart, your program’s answer to a paper’s graph, an inverse problem to its forward counterpart. Any of these place this work within context.

Colleagues and I struggle with testing well. It seems to be from a combination of missing tools, a lack of common practice, and a depth of knowledge required to be good at software, numerical analysis, and statistics, at the same time. I appreciate that you raised this question and enjoy hearing peoples’ approaches.

1 Like

Ironically enough, first I wrote my code in Python, but then I decided to switch to Julia. When I made the switch, I tested the data structures generated in Julia against the data structures generated in Python. Now, I consider making my Julia package more user-friendly by adding normal testing and documentation. And, as it figures, this is a bad idea to have your tests depend on the completely different language.

I did not think about having some data pre-generated for tests. I am grateful to @heliosdrm for pointing this out.

Oh, you had a Python version. You could totally keep that for testing if you want, even call it with PyCall to do direct @test comparisons. It’s a little by-hand right now to select which tests to run in which places. The only way I’ve figured out is to pass args to the test command, which become ARGS in runtests.jl, which can then be used for skipping tests. But it’s normal to run different sets of tests for C.I. versus acceptance testing versus local unit tests.

Colleagues sometimes throw R code at me with, “please make it go faster.” I put the R into the test/ directory and write Julia until it matches the output. Then I modify it. It’s that two-step dance to refactor & then add features, made so much easier by RCall or PyCall.

The question here is how to organize CI and codecov in github repo. To run the Python code, I need python installed, in conjunction with numpy, scipy and, may be, some other libs. I have all of these on my computer, but can I use all this stuff on the testing server connected to github repo?

Just pre-generate the test outputs, unless you want to keep maintaining the Python version indefinitely and need to keep the two codebases in sync.