Yes, at the end I ended with something like that, except that the RNG value is not the global one, but one defined inside a function depending on some input parameters (seed is provided or not, reproducible run is desired or not).
not really, I am comparing with the output of the same calculation in a series of controlled runs, with realistic input. So, yes, I am doing the brittle option probably, but I would feel quite unsafe if the testing was done with toy problems with analytical solution, because there are many many issues than could arise in corner cases of real problems because the actual “shapes” I am integrating are too complicated. For now I think I will stick with this option.
Perhaps you misunderstand: the issue is not how you obtain the “true” solution (analytical, or MC runs) you compare to, but how you establish the error bounds for CI.
Ah, yes. Well, for the moment the default precision required by
isapprox seems to be completely safe for the sequential version with the stable random number generator.
Testing the parallel version of the package is a separate issue, where those problems arise more seriously. I am yet to setup a safe testing routine for those runs (while my package has a parallel version which is working quite nice, I do not know yet how to run parallel tests in CI, but just didn’t have time to search for that yet).
So you get
√eps relative precision from a stochastic calculation? That looks suspicious — even for IID draws, you would need a very, very large sample.
If the random number sequence is exactly the same, why not?
I mean, this is just a more complicated example of this:
julia> import Random
julia> sum(rand() for i in 1:1000)
julia> sum(rand() for i in 1:1000)
Sure, but what are you really testing then? That the same calculation produces the same result? Or is it coded in two different ways, just using the same random stream?
Generally speaking yes, it is coded different ways. The idea is to have a bunch of tests that assure that whenever I introduce modifications in the package (to improve performance, for example, or new features), I do not break what was working before. Is that different from testing in any other context? Of course some modifications can be breaking in terms of the tests because of these random number sequences, but many (and the most frequent ones) won’t be. So that reassures me that I have not introduced regressions whenever I fix a bug, add some feature, etc.
Edit: but I understand your point. This kind of test is not designed for a major algorithmic modification of the package, for sure, in which case the test should aim the comparison with an expected result with a reasonable precision. I do have some tests of this kind, but they are not part of the automatic test set, because they take too long to run for a safe precision threshold.
For example: if the the
volume code above I decide to compute
cutoff2 = cutoff^2 and not take the square root of the distance at every iteration of the loop. That saves ~5% of the time and the results are identical. It is good to have quick tests that assure me that I have not done anything wrong when small changes like that are introduced, and that everything continues to work as expected.
But since we are there, is there any limitation for CI testing? Can I add a test that takes half an hour to run?
Of course if that is possible I could add tests for which the actual result is compared to the expected precision that a user would expect from the results.
Generally one would test some invariant that should hold given inputs and outputs.
Hardcoding input/output pairs is occasionally necessary though.
This depends on your CI setup — most frameworks allow you so set a longer timeout. Of course this will burn up any free tier very quickly.
For some of the economic models I am working on, a CI run takes 2–3 hours. But it is still great because actual estimation takes 1–2 weeks, so catching errors early is valuable. We ended up running CI on our own machine using Gitlab.