Best practices for testing numerical results that are unstable across package versions

My team is using OrdinaryDiffEq.jl and several other numerical packages, where we are testing a variety of solver results to be within some tolerance. On multiple occasions, some of these tests have begun to fail after updating the major or minor version of a dependency. Recently, I believe this happened due to some currently unknown dependency’s patch release, failing our tests without us changing anything. So far, the new results are not too far from what we had previously expected. However, we are concerned about the longevity of these tests and accumulated numerical drift.

julia> @test isapprox(result, 90.5; atol = 0.4)
Test Failed at REPL[10]:1
  Expression: isapprox(result, 90.5; atol = 0.4)
   Evaluated: isapprox(90.91580618639445, 90.5; atol = 0.4)

What are the best practices for such testing such results and avoiding failures with solver updates and numerical drift? We anticipate the following two suggestions, but would appreciate additional insight:

  • Using a more appropriate solver that provides better stability for our use case
  • Using validation pipelines to improve our confidence when updating broken tests
2 Likes

What kinds of numerical errors are you worried about?

Small changes in roundoff errors are hard to avoid between versions, or even between CPUs with the same version. But that doesn’t seem to be the problem in your case. In your example, you are seeing a relative error of about 4.6e-3 (whereas your atol corresponds to an rtol ≈ 4.4e-3) , which sounds way too big to be roundoff error unless you are doing something very ill-conditioned or numerically unstable (in which case: don’t do that).

It might be truncation error, which would suggest you might be setting your ODE solver tolerance too high for such a test. What error tolerance are you specifying/expecting?

1 Like

The default ODE solver tolerance has a relative per-step tolerance of 1e-3, which means each step gets about 3 digits of accuracy. This means you expect a total of around 2 digits of accuracy for the final solution.

This test seems to assume a relative accuracy of 3 digits, while solving with an accuracy of 3 digits. This suggests to me that the default tolerance was used while the test is written assuming a lower tolance.