Chaos compatibility testing?

Does anybody know if there is some package does a kind of “chaos compat engine” for deep testing, to test random versions of dependencies (and their dependencies) that fit within a library [compat] bounds?

I’ve ran into a couple of bugs in because some dependency-of-a-dependency was forced to a specific version due to another library existing in a downstream user’s project.

It would be awesome if there was a way I could randomly download packages within [compat] bounds (recursively) – as part of testing. I suppose such a tool would have to reach deep into how Pkg computes this, and would basically randomly select within the bounds, rather than selecting the most recent version?

4 Likes

I don’t know of any such package currently existing, but it shouldn’t be too bad to do if you’re willing to use Pkg a bit and don’t expect full coverage right away. You don’t really need to mess with the deep internals of Pkg either - you just need to set fairly random [compat] bounds on the dependencies of your package, re-resolve & run your testsuite (of course setting the bounds back to what they were originally).

Another approach would be randomly adding packages to your testing environment whose dependencies have an overlap with your dependencies. I’d imagine that will lead to a lot of false positives though, if some package is incompatible for reasons out of your hands.

It sounds like you’re asking for compatibility fuzz testing. I don’t know of a CI script for that but it sounds reasonable. However, I’d suggest using a form of Downgrade CI to more directly do this rather than random sampling, see:

2 Likes

@MilesCranmer is describing an issue due to a dependency of a dependency being downgraded though, not just direct dependencies. I don’t think the downgrade CI check takes care of that :thinking:

1 Like

No that is not the case. If you can ensure that all possible downgrades are bug-free, then you simply don’t need to care anymore about whether a dependency-of-a-dependency was forced to a specific version because all specific resolvable versions work. The way to prove this is the downgrade CI. In fact, the sole purpose of such a CI is to solve this very problem.

Downgrade CI only downgrades direct dependencies though, right? So the dependencies of those downgraded dependencies will still resolve to the highest possible version, not all possible allowed versions by their respective [compat] entries. The only way those dependencies of dependencies resolve to lower versions is if some other package is also loaded that forces the compat upper bounds to lower overall.

Since this is a potentially very large search space, fuzzing these versions seems entirely appropriate.

If all packages have done proper downgrade CI testing, then you can have the guarantee that any resolvable state is a working set of packages. So then you basically have no (known) bugs, or an error from the package manager stating that no bug-free resolution is known. Hence the PSA to get downgrade CI standardized so that lower bounds in packages are fixed. Basically, the only reason for any of this is because there are packages which don’t test their lower bounds and those lower bounds cause bugs, so if we solve the root of the problem (that packages shouldn’t lie in their Project.toml) then the rest goes away.

Right, that’s just not a realistic assumption to make though. There can be bugs released in some version inbetween the lowest possible bound & the highest possible bound that are only present in some version between those two extremes. Those kinds of bugs can not be found by checking only the lower bounds of packages individually, since each package in isolation may pass its extreme, while a combination of packages can result in some place in the middle.

Basically, downgrade/lowerbound CI only protects you from accidentally relying on newer features than the oldest version you claim to depend on actually provides, it doesn’t protect you against bugs introduced in newer versions of your dependency that you’d be hit by if the dependency happens to get resolved to that.

1 Like

If you need certain bug fixes for your package to work though, you should bump the minimum version and have a test that fails without it.

The key aspect here is really a question of who should put the bound. If Package A has an issue with Package B allowing an earlier version of Package C, should Package A add a higher minimum version of Package C? What I am essentially saying is that no, if Package B is known to be problem by allowing this earlier version of Package C, we should fix the compat bounds in Package B so that no downstream package runs into the issue, not put a bound in Package A that makes it so the local universe of B is seemingly okay.

Given that principle, I am suggesting that we design our testing and CI so that Package B tests fail with this problem with Package C. What you are suggesting is that Package A’s test fail with this problem with Package C, which sure fuzz testing is never a bad thing to add, but it’s ignoring what the root of the problem actually is and simply opting to a (relatively nice) workaround.

Yes, I agree with that! All I’m saying is that lowerbound testing through downgrade CI won’t catch those cases and doesn’t help developers fix those compat bounds.

In addition to that, fuzz pinning (i.e., pinning a direct dependency to any version allowed by [compat]) also wouldn’t have found the issue @MilesCranmer is talking about, since there the issue originated with an indirect dependency. You really do need some recursion or install of other unrelated packages that share indirect dependencies with your package to find these kinds of issues, by forcing the existence of “versions in the middle” of those indirect dependencies.

1 Like

If only direct deps are downgraded in this testing, then we won’t have such a guarantee even if everyone adopts it. If PkgA deps on PkgB deps on PkgC, it’s possible that PkgB works with any PkgC version – but PkgA doesn’t.

1 Like

You can only get a failure there if PkgC is implicitly changing PkgA’s behavior, which can only bypass PkgB if there’s a type piracy of some form. If it’s using public, documented, and tested behaviors of PkgB then it would have to be caught by the tests by definition. Otherwise there is a direct dependency.

You can also get a failure if PkgA depends on a part of PkgC that PkgB doesn’t depend/rely on for its own tests to fail. You’re only guaranteed to get a failure if every part of a package is tested at every step of the dependency chain, which isn’t usually the case. Not to mention that this requires perfect coverage, which we don’t have the ability to try to measure right now.

That’s my point though. You’re either talking about a case where PkgA is directly using a feature of PkgC not through PkgB, which is thus either through piracy or should be an explicit dependency, or you’re talking about a case where PkgA is using a feature of PkgB that relies on PkgC and isn’t tested in PkgB. This latter case of course has a clear solution that you should fix PkgB to have fuller test converge and fix the break in PkgB, not try to patch it in PkgA. In either case, the issue is that it’s an indirect dependency that should be made direct, or PkgB is the thing that should get the better tests and fixes not PkgA.

Now one thing that PkgB should be doing to better improve its situation here is downstream testing, where PkgB should be running PkgA tests to know if anything it’s doing effects PkgA in an adverse way. The common downstream CI is precisely for this which is why it’s widely used (though it still needs more adoption itself).

Say, PkgB returns a type from PkgC, and PkgA uses that return value afterwards. PkgC can make some minor changes to the behavior of that type over time, that it doesn’t consider breaking. But PkgA unknowingly relies on those specifics, and would only work with some PkgC versions.

I think it’s unambiguous that testing downgrades of direct + indirect deps is strictly better and leads to more reliable ecosystem than testing direct deps downgrades only. And the most direct way to do that would be just to flip the sign in the Pkg resolver (:

1 Like

Oh I agree, and the discussion in PSA: Add Downgrade CI to Better Check Version Compatibility - #33 by Mason is to have the downgrade CI be updated to do exactly that (discussion in pkg-dev as well)

Not necessarily - you can also have this kind of dependency chain:

PkgA
 | PkgB -> PkgC
 | PkgD -> PkgC

where both PkgB & PkgD depend on a subset of the documented & tested functionality of PkgC. Now, PkgB and PkgD may have different compat bounds for PkgC, and for each of those ranges their respective testsuites work in their entirety. This even still works if PkgA is fully testing its upper & lower bounds.

However, if you now have such a dependency chain:

PkgF -> PkgG -> PkgC
PkgA
 | PkgB -> PkgC
 | PkgD -> PkgC

and PkgF has compat bounds that just so happen to force PkgC onto a version where the combination of the functionality of PkgB and PkgD is broken for the purposes of PkgA, there’s just no obvious place to put a test to smoke that out, without forcing the (to PkgA indirect) dependency PkgC onto a specific version. That’s the example @MilesCranmer was bringing up in the OP; a third party package (PkgF) was loaded that resulted in PkgC (the shared indirect dependency) being broken for the purposes of his package (PkgA). All of that is still true if all of the involved packages test the respective lower bounds in isolation.

Short of having a perfect test suite for PkgA (let’s not kid ourselves, that doesn’t exist), you can only realistically find this by either randomly pinning PkgC when testing PkgA, or by loading some other package that forces this downgrade to occur.

But then you’d get a package resolver error instead of a runtime error because you would have already been required to exclude that any non-recent versions which conflict (downgrade CI) and stopped any new versions that cause a downstream regression (downstream CI).

What you’re describing is exactly the case of Optim.jl adding an Adam optimizer, something not essentially breaking in and of itself, and thus Optimization.jl seeing a break when used with Optimisers.jl because then both export the same Adam type, and thus it needs to be disambiguated in usage in DiffEqFlux.jl. In this case the good solution isn’t to retroactively fix bounds in DiffEqFlux.jl but rather to have appropriate downstream testing to be notified during the test cases of Optim.jl of downstream breakage.

In other words, downstream + downgrade at least catches the 99% of these cases in the ecosystem, but some packages haven’t adopted the combination yet. There’s also something to be said about better interface testing too, but I’ll leave that for another time.

Or by forcing the most downgraded case, under the assumption that things are continually improving and thus the most downgraded case is a good proxy for the state with the least features and patches. That adds a (reasonable) assumption but turns the CI into something deterministic. Fuzz testing can still be a good thing to add, but I’m just saying that before adding stochasticity to CI we should at least have made sure to get widespread adoption of reasonably stronger deterministic improvements that catch the core 99% of cases. Going straight to fuzzing adds its own complications (being an option that’s decent for maintainers but bad for growing contributors).

Not if the problem is a bug that only got exposed because of that specific combination!

You’re assuming that every package has perfect coverage & perfect testsuites for all functionality it can offer, but that’s IMO not a realistic baseline to start off of.

Yes, that is definitely better than the status quo, but again, that would not necessarily have exposed the issue presented in the OP - especially if the lower bounds on PkgC of PkgB & PkgD are lower than the lower bound on PkgC of PkgF, and the bug that caused the breakage being in that lower bound of PkgF. PkgA doesn’t depend on PkgF after all, it only got loaded because some third party wanted to use both PkgA and PkgF in the same environment.

I’m not claiming that fuzzing is a panacea. I’m just saying that downgrade CI for PkgA would not have caught the issue that started this thread.

1 Like

No, I’m explicitly not assuming that which is the reason for downstream CI as part of the picture. Downstream CI is essentially a fallback for missing test coverage or interface specification to catch the core users misalignment with the test suite. That quick fix for coverage is why it’s part of the status quo. In the cases you’re describing, either PkgB or PkgD would have to see failures in their downstream CI or they would work together. This at least makes sure latest is always working (unless they both happen to merge and tag within the same half hour). Now it’s also possible that there is a past version where the combination didn’t work. Under the assumption of continual improvement, the downgraded CI would have a good chance of finding that (if the furthest back also has this problem, which it would in the Optimisers + Optim case which is an instantiation of what you described) which then causes a bump of minimum versions to omit that case for anything in the future.