[ANN] Supposition.jl

That’s a very good question, but the answer is a bit more involved.

One key aspect of Supposition.jl is the ability to shrink found counterexamples in some predictable manner. Vectors shrink in length, as well as shrinking each element in that vector. Tuples (created through e.g. @composed tuple(Data.BitIntegers(), Data.Floats())) shrink by shrinking their elements. The key is that this shrinking process keeps existing invariants of the various Possibility objects that generated one of those objects in the first place in tact. For example:

julia> using Supposition

julia> data = filter(iseven, Data.Integers{Int16}());

julia> vec = Data.Vectors(data; min_size=3, max_size=20);

julia> example(vec, 4)
4-element Vector{Vector{Int16}}:
 [-4326, -25920, -14638, 11838, 17878, -4306, -30360]
 [-2052, -4920, -9792]
 [-26346, -18678, -22036, -12032, -6264]
 [29074, -18096, -17534, 3048]

In this example, any example that vec produces is guaranteed to always be a Vector{Int16}, the elements of which are always even and the length of which is always in 3:20. The examples also shrink predictably; the shrunk vectors will also always have at least 3 and at most 20 elements, while the shrunk elements will always be even. That is to say, everything involved here has a deterministic minimum shrink (up to choice of property, of course). For a property that always returns false, the same input should shrink to the exact same minimal counterexample (though different inputs may shrink to different minimal counterexamples, as long as they always shrink to that same one).

So if this were only about drawing some random object from a collection, indeed, the standard statistical tools would be enough! Unfortunately, that’s not sufficient, because using those tools doesn’t easily allow controlled shrinking because of the way they’re interacting with RNG. The way Supposition.jl works is by recording the binary choices that led to a given input, and removing or modifying that sequence of binary choices to produce an example that requires fewer such choices that still lead to the same error/failure.

Now, if I’d just call rand(rng, <distribution>) when generating an example, all Supposition.jl would have to go on for shrinking would be the initial seed of that rng! By its very nature of a PRNG though, this is pretty much a worst case scenario because tiny changes in seed have huge influence in the resulting number stream that statistical package sees, which has a huge knock-on effect on what data is actually generated. There is no easy correlation between what ends up in your test case and what the initial seed for that PRNG was (otherwise, our RNG would be quite broken), so it’s extremely difficult to shrink elements produced through calls to rand in a predictable or good way (Supposition.jl does manage to make this process deterministic, but it can’t really do much with that - the determinism is only sufficient to give replayability, but not good shrinkability).

Admittedly, I don’t know whether the particular algorithm I use here is implemented in a standard statistical tool in the ecosystem, but the important part is that it’s Supposition.jl which controls the actual sampling step and records any choices required to do that, in order to control how that sampled value shrinks. This is why a special construction is required, to make the shrinking well-behaved and predictable.

Well, and on top of this - the construction in Supposition.jl has extremely nice big-O properties; it’s based on this explanation of sampling loaded dice by Keith Schwartz, with \Omega(n) initialization time, \Omega(1) generation time and \Omega(n) memory usage (3n, to be precise). The original paper that explanation is based on is “A Linear Algorithm For Generating Random Numbers With a Given Distribution”.

I hope that answers your question!

3 Likes