Best practices for testing numerical results that are unstable across package versions

My team is using OrdinaryDiffEq.jl and several other numerical packages, where we are testing a variety of solver results to be within some tolerance. On multiple occasions, some of these tests have begun to fail after updating the major or minor version of a dependency. Recently, I believe this happened due to some currently unknown dependency’s patch release, failing our tests without us changing anything. So far, the new results are not too far from what we had previously expected. However, we are concerned about the longevity of these tests and accumulated numerical drift.

julia> @test isapprox(result, 90.5; atol = 0.4)
Test Failed at REPL[10]:1
  Expression: isapprox(result, 90.5; atol = 0.4)
   Evaluated: isapprox(90.91580618639445, 90.5; atol = 0.4)

What are the best practices for such testing such results and avoiding failures with solver updates and numerical drift? We anticipate the following two suggestions, but would appreciate additional insight:

  • Using a more appropriate solver that provides better stability for our use case
  • Using validation pipelines to improve our confidence when updating broken tests
3 Likes

What kinds of numerical errors are you worried about?

Small changes in roundoff errors are hard to avoid between versions, or even between CPUs with the same version. But that doesn’t seem to be the problem in your case. In your example, you are seeing a relative error of about 4.6e-3 (whereas your atol corresponds to an rtol ≈ 4.4e-3) , which sounds way too big to be roundoff error unless you are doing something very ill-conditioned or numerically unstable (in which case: don’t do that).

It might be truncation error, which would suggest you might be setting your ODE solver tolerance too high for such a test. What error tolerance are you specifying/expecting?

3 Likes

The default ODE solver tolerance has a relative per-step tolerance of 1e-3, which means each step gets about 3 digits of accuracy. This means you expect a total of around 2 digits of accuracy for the final solution.

This test seems to assume a relative accuracy of 3 digits, while solving with an accuracy of 3 digits. This suggests to me that the default tolerance was used while the test is written assuming a lower tolance.

2 Likes

It is hard to be specific without an actual example, but I would assume that either

  1. your tolerances are too tight, or

  2. there is a (subtle) bug in your code that is causing a minor error (could be the wrong numerical approach as suggested by @stevengj).

If you can put come up with a reasonable expected tolerance using first principles, then do just use that. If that fails, either the algorithm is not as accurate as you were assuming, or there is a bug. It is then very tempting to just increase the tolerance and be done with it, but that may just disguise the bug. [ask me how I know :wink:]

If you cannot come up with a tolerance from math alone, you can do various approximations. Eg perform the calculation using BigFloats, and do it multiple times, perturbing the inputs with some random tiny relative adjustments (I usually use 1e-12 for Float64). The range of outputs will give you a rough idea for expected tolerances.

(A more rigorous way to do this is using interval arithmetic, if the code is generic enough to support it, but for complex calculations the bounds may be too wide to be practically useful.)

1 Like

You’re assuming it’s roundoff error. It seems more likely that it’s truncation error (due to the finite step size of the ODE solver), which has little or nothing to do with floating-point arithmetic. Changing the precision won’t help.

i.e. if you want to do a @test with a relative tolerance of 4e-3, you need to set the tolerance on your ODE solver (which controls the step size) to be much smaller, e.g. 1e-5. Otherwise any change in the ODE algorithm might lead to the truncation error changing by more than your tolerance.

(I’ve noticed this in QuadGK.jl as well — so many people just use the default tolerance, apparently thinking that algorithms like this are like sin or A \ b and are only limited by roundoff errors that they can’t control.)

3 Likes

I am not assuming anything, as we have little to go on, lacking an actual MWE.

Though you make a valid point, truncation error would be a likely possibility if one is comparing to a closed-form result. In that case, I would try to get a rough estimate of the magnitude of the error by varying the stepsize or an equivalent parameter.

The point that I am trying to make is that it is usually possible to gauge the error magnitude with some numerical experiments.

We have the facts that (a) the errors are on the order of 0.5% which is huge for roundoff, (b) it’s the result of an ODE solver, where truncation errors usually dominate, and (c) Chris says that the default tolerance corresponds to truncation errors on the order of 1%.

You’re right that there is a certain amount of speculation due to the lack of a MWE, but focusing on roundoff errors and precision would not be my first resort here.

1 Like

Thank you all for your thoughts!

@stevengj at a high-level, we are trying to ensure that our tests do not diverge from previously validated values, which are currently quite difficult to re-validate. While we are aware of and working to improve our specific use of solvers, tolerance, and other methodology, the primary intent of this post is to ask about higher-level strategies for such testing. If the answer is simply “be more rigorous”, rather than there being some more fault-tolerant approaches, that is still helpful!

@Tamas_Papp we can try these suggestions! It’s very possible we have one or more bugs, but are currently more concerned about the test failures aligning with updates in dependencies rather than our code.

The above example was for a similar case, here is the specific error that prompted this post. We’re pinning OrdinaryDiffEq.jl to v6.106.0 and are still seeing a larger than expected jump in results.

  • reltol = 1e-5
  • method = BS5()
  Expression: isapprox(result, 9821.50204844313, atol = 1)
   Evaluated: isapprox(9825.315767629512, 9821.50204844313; atol = 1)

My coworker suggested some additional runs of the same test, for comparison:

  • reltol = 1e-5
  • method = Tsit5()
  Expression: isapprox(result, 9821.50204844313, atol = 1)
   Evaluated: isapprox(9856.869973203444, 9821.50204844313; atol = 1)
  • reltol = 1e-8
  • method = BS5()
  Expression: isapprox(result, 9821.50204844313, atol = 1)
   Evaluated: isapprox(9855.907363766271, 9821.50204844313; atol = 1)
  • reltol = 1e-8
  • method = Tsit5()
  Expression: isapprox(result.statistics.wmean_power_turbines, 9821.50204844313, atol = 1)
   Evaluated: isapprox(9872.343287040627, 9821.50204844313; atol = 1)
1 Like

Where does this number come from that you are comparing to?

If it’s from a previous version, why doesn’t it change when you lower the tolerance? (If you only lower the tolerance in the new version, obviously it’s not going to converge to a larger-tolerance result in a previous version.)

What kind you of ODE do you have? If it’s chaotic, it will be hard to get a precise answer after a long integration time. Can you post a runnable example of something you are struggling to get consistent results with?

2 Likes

(That’s very reasonable, but requires you to have error bounds on the previous values to know how accurately they should be reproduced. How did you get the error bounds?)

4 Likes

Why? If you have the ODE coded up in Julia, OrdinaryDiffEq.jl makes it very easy to try some high precision solvers.

Only changing one tolerance might not be doing what you are expecting. To establish a baseline, take two separate methods and solve it with abstol=1e-12, reltol=1e-12. I’d do Vern9(), abstol=1e-12, reltol=1e-12 and Ver8(), abstol=1e-12, reltol=1e-12. The difference between those is a relatively straightforward way to approximate how much it has converged.

If that’s only at 2 digits, I’d have to know more about your ODE. Are there things like undeclared discontinuities?

I was going to report a similar issue for my package TestParticle.jl that uses solvers like Tsit5() from OrdinaryDiffEq.jl. All the tests looked fine 2 days ago. However, when I tried to create a PR today, I found some surprising numerical comparison failures that are probably linked to the update of the ordinarydiffeq solvers:
feat: add batch_size for distributed solver · henry2004y/TestParticle.jl@4e93055

I also ran locally with the master branch to confirm that this is something originated from the upstream. Currently I have no idea which upstream pkg it is.

At what tolerance are you setting the ODE solver? Using without a tolerance is usually not going to work for an ODE solver. How did you compute the reference value?

1 Like

I can see right here in your tests:

        sol = solve(prob, Tsit5(); save_idxs = [1], callback, verbose = false)
        # There are numerical differences on x86 and ARM platforms!
        @test sol[1, end] ≈ 0.8539409515568538

you are doing exactly what is described in this thread. The ODE solver is only solving by default with reltol=1e-3, abstol=1e-6, which is a local (not global) tolerance, which generally means you get 1-2 digits of accuracy as described in the documentation.

So you told the computation to only be accurate to only 1-2 digits, and you wrote a test with without setting atol or rtol, and thus it defaults to sqrt(eps(T)) which is 7 digits of accuracy IIRC. Of course that test is going to drift: your test is set to a lower tolerance than the solve. Either increase the tolerance on or decrease the solver tolerance, but as stated that test isn’t meaningful.

4 Likes

Thanks for the clarification! I will adjust the tests by setting atol and rtol accordingly.

As a related question, does this mean that the results of the ODEs like Tsit5() is not guaranteed to be the same across versions? This is the first time for me to observe these numeric changes even though the “wrong” tests existed for more than two years.

It’s only guaranteed up to tolerance. There’s many things that can change it: using a different CPU can get you different amounts of SIMD which changes the result slightly (your tests have even noticed this). It’s simply not possible to guarantee bitwise compatibility.

Part of this can even be changes due to floating point: (x + y) - y is not necessarily equal to x and this means that just a small move of time stepping logic can can small changes, and ODE solutions exponentiate small changes. Adaptivity gives a sense of accuracy to a tolerance that you choose, but beyond that it’s hard to force absolutely no changes.

In fact, I used to have a talk on reproducibility where I would tell people to implement a version of the Lorenz equation, and then ask the audience what solution they got at t=100, and from the answer I could look at a table that said “so you used an Intel CPU MacOS with SciPy?”, the solution to chaotic equations is O(1) error and so it turns out the ODE solver’s solution is a fingerprint of your OS, CPU, etc. because differences in your system’s math library, SIMD optimizations, how the binary was chosen to be compiled, the defaults of the solver, etc. all stack in funny ways.

9 Likes

We are also seeing changes in the results from adaptive timestepping in Trixi.jl with some drastic changes, see, e.g., Remove unused (?) tols in IMEX examples · trixi-framework/Trixi.jl@647ca50 · GitHub. We traced it back to this PR Add larger qmax on first step (Sundials CVODE behavior) by ChrisRackauckas-Claude · Pull Request #3071 · SciML/OrdinaryDiffEq.jl · GitHub, where you can also see tolerances had to be adjusted, see, e.g., Add larger qmax on first step (Sundials CVODE behavior) by ChrisRackauckas-Claude · Pull Request #3071 · SciML/OrdinaryDiffEq.jl · GitHub.

1 Like

On a somewhat personal note, I find PRs like the one mentioned by Joshua very worrying.

The fact that a bot adjusts test tols and that PRs from a bot which have an un-ticked box “CI passes (pending SciMLBase.jl companion PR merge)” are actually merged is not only potentially dangerous, see the issue I reported.

1 Like

The reason why that tolerance was adjust is pretty obvious though, right? Do you have any actual qualms with it there? Why do you believe the tolerance on that wouldn’t change? It’s pretty well-explained?

It never checks the box because it always opens the PR and then checks CI, and so it doesn’t check the box because the writing of the PR happens before CI runs. Is that really a huge issue that we don’t locally run the full CI?

The issue that you reported was very clearly an issue with the test that you wrote. You even wrote a comment right there in the test that you relied on a very big assumption:

# We supply a small initial timestep to be able to use a larger AMR interval (2 instead of 1) throughout the simulation.
# This pays off almost immediately as only the first couple timesteps use this timestep before it is ramped up.
dt0 = 1e-8

which was not true for most ODE solvers, right? Or was your comment not correct? And if you fix the test:

then the behavior is corrected.