Best practices for testing numerical results that are unstable across package versions

Yes, the reason why tolerances need to be changed is clear. But then why not put the change into a breaking release? If you need to adjust your own test suite because previously successful tests do not pass anymore, that’s a sign that this PR is breaking, isn’t it? If the change is properly released in a breaking release, there is no issue. But I’m not planning to argue on this much further. We’ll adjust our tests accordingly. Thanks for your response and help!

Where in the documentation did we guarantee bit accuracy between DP8 and dopri8? We don’t make that claim. In fact it hasn’t been true since fastpow was introduced. We can make a better direct regression test there, but it would require also reversing fastpow and rely on the using the system pow implementation to match the Fortran code. So yes, we can improve that test, and I already started doing a few things to just have that around, but why would that be considered breaking? We never guaranteed to any user that it was exactly a dopri8 time stepping routine?

2 Likes

Also, for your case I just ran a complete convergence analysis Use IController in implicit restart test for deterministic behavior by ChrisRackauckas-Claude · Pull Request #2839 · trixi-framework/Trixi.jl · GitHub and the result is very clear:

tol L2[rho] (old ODECore) L2[rho] (new ODECore) Steps (old) Steps (new)
1e-3 0.02228 0.02291 15 14
1e-4 0.02765 0.02747 54 49
1e-5 0.02926 0.02901 141 131
1e-6 (test) 0.02875 0.02922 399 385
1e-7 0.02840 0.02838 940 910
1e-8 0.02838 0.02838 2204 2268

Your ODE solution had about 2 digits of accuracy (which it stayed within between the two versions), but your test tolerance had 1.5e-8 expected on it. If your ODE solution is only correct to 1%, why are you testing for 0.0000001% changes and then complaining when that test fails?

We’ve done the due diligence to ensure that this is all indeed correct. But, since we have the OrdinaryDiffEq v7 right around the corner, would it be easier on users if we disable it by default until the v7? If your tests fail with it, you should fix your tests because there doesn’t seem to be anything wrong here… but, we can also ease the transition if that helps folks.

First, thanks to everybody involved for reporting their issues and perspectives. As far as I see, there are two conflicting points of view, originating from different expectations of what is the guaranteed stable behavior that should not change unless a breaking release is made.

  1. Some people from the Trixi.jl developer group (and the Julia user who started this thread) seem to assume that stability guarantees are made in the following way: small changes around floating-point precision (per step) are of course fine, but bigger changes that can change the result visibly should be made in breaking releases.
  2. Chris has the point of view that everything below the tolerances abstol and reltol passed to solve can change in every release.

This explains the conflict that came up here and, e.g., in Different behaviour between `OrdinaryDiffEqCore 3.9` and `OrdinaryDiffEqCore 3.10` · Issue #3101 · SciML/OrdinaryDiffEq.jl · GitHub and Add larger qmax on first step (Sundials CVODE behavior) by ChrisRackauckas-Claude · Pull Request #3071 · SciML/OrdinaryDiffEq.jl · GitHub.

I can understand the first point of view, e.g., since Julia itself adds some specific warnings if this assumption is not true, e.g., for random numbers.
I can also understand Chris: improving the performance without having to orchestrate a breaking release of a big package ecosystem is nice.

I guess everybody has their arguments and we just need to go ahead. Since OrdinaryDiffEq.jl and its sub-packages are developed in a monorepo (for a good reason), it is often difficult to track down changes that happened recently. A small change of behavior that could make it easier for people to check what has changed in OrdinaryDiffEq.jl could be to disable merge merge commits and just use squash merging. Then, every PR would be associated to a single commit on the master branch, making it easier to see what has happened recently.

2 Likes

I intend to reply tomorrow to some very helpful responses since my last response, but quickly want to interject that I do not hold that view. Instead, I am simply asking for the best practices for this issue and will happily accept and implement the result.

3 Likes

I find it weird that you are calling this a “point of view”, as it is the documented API.

Solvers solve within tolerances. That’s all they promise.

Nothing sufficiently complex in numerical floating point calculations is fully reproducible, even with the exact same package versions in the manifests, that’s the wrong user expectation. No “visible” changes in outputs is an impossible expectation in practice.

1 Like

This very much feels similar to issues with hashmap/dict iterator ordering:

If something is stable for a long time and “feels reproducible”, then people/tests will start depending on it.

A typical approach is to have optional randomization for tests. Sure this is expensive; but it doesn’t need to be enabled at real runtime, not even on “real CI test time”; it is basically “testing the tests”.

How would that ideally look like? In view of the tolerance discussions, I’d guess that a solver tasked with “compute to tolerance epsilon” could simply compute to tolerance epsilon/2 and then add noise of order epsilon/2, with a variety of noise models (“no noise”, “statistical random noise”, a variety of “biased noises”; potentially even “adversarial noise”, but that’s probably too annoying to implement).

Well, binary reproducibility on the same machine is extremely useful for debugging, and it’s nice if this property holds without RR.

To give examples where binary reproducibility on the same machine fails: Data races that are not UB (e.g. because atomics are used); logical data races that are not UB (adaptive splitting of problem instances, depending on how many cores are idle); things that are affected by objectid / addresses (e.g. hashmap iterator ordering depends on the internal state of the OS kernel that hands out mappings).

Making the function non-smooth by adding white noise to the rhs might affect an ODE solver in unexpected ways, but you could add smooth (bandlimited) noise similar to randnfun in chebfun, e.g. by taking a finite Fourier series with randn coefficients.

On the other hand, if the solver itself adds the noise, it could probably just add white noise if it does it at the right place, after it’s made adaptation decisions? But that would require the developers to add custom hooks into every single solver algorithm, and there are a lot of them.

That was a generated from a past run which was validated.

I’m not quite sure which runs you’re referring to here, but I understand what you mean by lowering the tolerance won’t converge to a value generated from a higher tolerance.

It is chaotic!

Unfortunately not, which I why I’m asking for high-level strategies :slight_smile:

I do not believe we came up with error bounds in a systematic way, so this is definitely a great place for us to improve!

We have several layers of validation, not all of which are in Julia :slight_smile:

This is really helpful, and seems like a good way to approximate some reasonable tolerances to test on. Thank you!

We have a variable number of discontinuities that are handled with callbacks.

That could explain why you only have two digits.

2 Likes