[ANN] Supposition.jl

Supposition.jl

Hello everyone! I am very excited to announce Supposition.jl, a property based testing/fuzzing framework using choice sequences, inspired by the Python framework Hypothesis. It supports Julia version 1.8 and up. The package uses ScopedValues.jl internally, so older versions can’t be supported - sorry!

Please do check out the documentation to get a taste for how to use the package. In particular, the section about Aligning Behavior & Documentation of functions is bound to be insightful :wink:

On a high level, Supposition.jl has these capabilities:

  • Minimization of generated examples
    • Targeted exploration of the search space, to help prevent the fuzzer getting stuck
      • Barring reflection on actual code coverage information, this works a form of coverage tracking (more & better things are planned though)
    • Shrinking of encountered errors to the minimal input producing that specific error
  • Powerful combination of existing generators into new ones
  • Fuzzing of stateful datastructures & statemachines, with operations on them
  • Deterministic replaying of previously seen failures

and these goals:

  • Performance - a test framework should not be the bottleneck of a testsuite; avoidable bad performance is considered a bug, and has its own label on the issue tracker
  • Ease of Use - bad UX is considered a bug, and has its own label on the issue tracker
  • High Quality Documentation - bad Docs are considered a bug, and have their own label on the issue tracker

You’ll be the judge of whether Supposition.jl achieves them!

For more specific questions, feature requests and design ideas, please don’t hesitate to share them on the Discussions tab of the repository! Feel free to browse there too, there’s already a bunch of things I’d like to add on there. Most of them also already have a plan for how to implement them, so getting them done is mostly a question of time & budget.

Without further ado, here’s a short usage example:

julia> using Test, Supposition

julia> @testset "Examples" begin

           # Get a generator for `Int8`
           intgen = Data.Integers{Int8}()

           # Define a property `foo` and feed it `Int8` from that generator
           @check function foo(i=intgen)
               i isa Int
           end

           # Define & run another property, reusing the generator
           @check function bar(i=intgen)
               i isa Int8
           end

           # Define a property that can error
           @check function baba(i=intgen)
               i < -5 || error()
           end

           # Feed a new generator to an existing property
           @check bar(Data.Floats{Float16}())

           # Mark properties as broken
           @check broken=true function broke(b=Data.Booleans())
               b isa String
           end

           # ...and lots more, so check out the docs!
       end
┌ Error: Property doesn't hold!
│   Description = "foo"
│   Example = (i = -128,)
└ @ Supposition ~/Documents/projects/Supposition.jl/src/testset.jl:280
┌ Error: Property errored!
│   Description = "baba"
│   Example = (i = -5,)
│   exception =
│    
│    Stacktrace:
│     [1] error()
│       @ Base ./error.jl:44
│     [2] (::var"#baba#11")(i::Int8)
│       @ Main ./REPL[2]:18
└ @ Supposition ~/Documents/projects/Supposition.jl/src/testset.jl:275
┌ Error: Property doesn't hold!
│   Description = "bar"
│   Example = (Float16(0.0),)
└ @ Supposition ~/Documents/projects/Supposition.jl/src/testset.jl:280
Test Summary: | Pass  Fail  Error  Broken  Total  Time
Examples      |    1     2      1       1      5  2.5s
  foo         |          1                     1  1.0s
  bar         |    1                           1  0.0s
  baba        |                 1              1  0.2s
  bar         |          1                     1  0.0s
  broke       |                         1      1  0.0s
ERROR: Some tests did not pass: 1 passed, 2 failed, 1 errored, 1 broken.

FAQ

What is property based testing?

In a nutshell, property based testing (PBT) is a testing philosophy where instead of testing your code with individual minimal inputs, you instead define various properties that you’d like your code to have (such as - “serialize and deserialize should be the inverse of each other”), generate random inputs to your functions and ultimately test whether your code upholds the properties you want it to have when presented with that random input.

This approach can be much more fruitful in ensuring the API surface of a project doesn’t drift & stays aligned with the documentation, because the random input ensures your tests don’t stay green by chance. At the same time, the random nature of fuzzing helps exercise more distinct code paths than would otherwise be reached, resulting in higher coverage. Quite often, unusual combinations of arguments that are forgotten with manual example based unit testing are found immediately with fuzzing.

Why a new package?

Some people may remember my last package in this direction, PropCheck.jl, and may rightly wonder why I’m writing a new one. The reasons are simple:

  • PropCheck.jl was (comparatively) slow. Supposition.jl is, in some circumstances, up to 1000x times faster while using 200x less memory than PropCheck. For other situations, PropCheck.jl fails completely due to overreliance on deeply nested functions that make the compiler give up when trying to infer a result type. Supposition.jl can still easily handle these cases; for more information check out these comparison benchmarks from the documentation.
  • The architecture of PropCheck.jl was heavily inspired by Hedgehog, a property based testing framework for Haskell. While very powerful & type oriented, the heavy reliance on function-based abstractions makes this kind of architecture not a good fit for Julia. The inner workings are hard to follow, and I still (after more than 4 years of working on it) get lost in the inner details of PropCheck.jl. In contrast, I can still easily hold Supposition.jl in my head. In addition, the web of functions & forced lazy abstractions made debugging internals quite painful - which is not the case in Supposition.jl (I’ve successfully used Supposition.jl to debug itself, and a large portion of the testsuite is just dogfooding its own API).
  • PropCheck.jl wasn’t able to minimize errors, and retrofitting that capability would have meant rewriting the entirety of the package anyway. Pretty much the same is true for replaying of failure cases and steering shrinking with target!, as can be done in Supposition.jl without much difficulty at all.
  • PropCheck.jl has a very clear distinction between “generation” and “execution” of a property; you cannot mix the two, and cannot have new data produced as part of the “execution” step that subsequently also shrinks together with the data from the “generation” step. This is trivial for Supposition.jl, and is supported out of the box (adding that feature once the core of Supposition.jl was done was literally a 12 line diff).

So while the core of PropCheck.jl is, for its purpose, very solid, it’s not easy to work on its internals at all and it can’t (with reasonable effort) support many of the features I ultimately wanted to support (as well as quite a number of additional ones that are not in this initial release of Supposition.jl yet :wink: ).

Can I use this for formal verification?

TL;DR:

No.

A bit longer:

As was the case with PropCheck.jl, Supposition.jl is not a tool for formal verification work, and I still firmly believe that we’re VERY far away from having that sort of capability in Julia. Most interfaces from Base or the ecosystem are very loosely defined, and few (if any) actually give hard guarantees over the datastructure & algorithms in question, making formal verification extremely difficult. Multiple dispatch only makes this more difficult, since anyone can just redefine a function to mean something entirely different on their particular object. Any formal guarantees must thus be made on a per-method basis, and there hasn’t been any progress on that front as far as I’m aware.

This is not to say that formal verification of Julia code will never be a thing - I just happen to think that before we as a community can run, we need to learn how to walk :slight_smile: Supposition.jl is an attempt to help with that, by providing a tool that assists with finding the actual API boundary of any given function/datastructure.

I’ve already said it in the announcement of PropCheck.jl, but I’ll repeat it here: if this package becomes obsolete due to tremendous advances in formal methods & verification in Julia, I’ll happily retire this package to the annals of history :slight_smile:

48 Likes

Looks cool! I must commend the documentation, which is quite extensive for a recently released package.

I see that @checks can be used from within a @testset. Does it mean that one can use @checks along some ordinary @tests in the common test-suite?

If my understanding is correct, Supposition.jl is a successor of PropCheck.jl. Have you considered redirecting the users of PropCheck to this new package? Perhaps with a migration guide, too?

2 Likes

Thank you! I want people to get ready to use this as soon as possible, and good initial docs are, in my view, the best way to do so :slight_smile:

Yep! Under the hood, @check provides an Test.AbstractTestset that records its own result with its parent testset. You can, for the most part, think of @check as a Supposition.jl-specific @testset. There is one caveat: you cannot use @test inside @check.

I have thought about this quite a bit, but haven’t taken the steps yet. My current thinking/plan is to have PropCheck around in mostly maintenance mode, and direct new features to Supposition.jl. New bugs will still be fixed though.

I originally wanted to incorporate the approach taken by Hypothesis into PropCheck, but noticed that it would basically require rewriting the whole package from the ground up, so that’s what I did. In the end, I’ve decided to keep PropCheck as a “here’s how you could port Hedgehog/a Haskell project to Julia” example, but I wouldn’t use it for new packages. It’s just too slow and reliant on the compiler playing nice.

In terms of migration - @check should be strictly more powerful than PropCheck.check ever was, and is the closest equivalent. Composition of generators is similar too. I can add a short section in the docs of Supposition.jl, with a table listing equivalences if that helps?

2 Likes

PropCheck seems to have gained a lot of attention (77 stars on github), so I’d say it is a good idea to let the users know there is a better alternative/successor now—but I see you have done that just now in the PropCheck repo :slight_smile:

1 Like

Yes - I’ll add the migration table for the syntax differences between PropCheck.jl and Supposition.jl to the docs of Supposition.jl over the weekend :slight_smile: That’s a bit more involved than just adding a note to the README & closing issues to redirect them.

3 Likes

Thank you. I work with stocastic algorithms, with random generator, so I used Hypothesis in Python and more recently, PropCheck in Julia. A improvement through a new package is something that I appreciate it a lot.

1 Like

Thank you, that’s very encouraging to hear! If you used PropCheck.jl before, you’ll certainly appreciate the speedup Supposition.jl can offer :slight_smile: With time, I hope to catch up to Hypothesis feature-wise too!

2 Likes

Alright, I’ve added a comparison table to the FAQ; let me know if something from PropCheck is missing in there!

3 Likes

I’ve just tagged & released version v0.2.0! :tada:

This release isn’t that big, but includes a good bugfix/feature. In v0.1.0, errors could be shrunk, but the way it was done lead to very poor results when more than one distinct error was encountered.

What does “distinct” mean here? Previously, this meant whether the two errors were ==, but for some error classes that store richer metadata, this would pretty much always return false due to == falling back to ===, since most error types don’t customize equality.

In v0.2.0, this has changed a bit - by default, only the types of encountered errors are compared. So even without specializing ==, shrinks are much better now. In case the error has some richer metadata and/or some special handling is required, there is now also Supposition.err_less (not exported), which implements an ordering between errors of the same type. Please see its docstring for more information!

Also, in case more than one distinct error is encountered, each error type & its unique location will only be reported once to a user; so no more getting spammed by "Encountered an error, but it was different from the previously seen one - Ignoring!" warnings :slight_smile:

2 Likes

Alright, it’s been about a week and since I’ve had a bunch of free time, have another release. This time, a more featureful one!

v0.3.0 brings a number of new features with it, as well as a very interesting new doc section covering how to use a test oracle when porting code to Julia! Be sure to check out the new doc section on event! to learn more about that.

Breaking Changes

  • Due to an update of the internal format DirectoryDB uses for storing counterexamples, previously found failures can lead to an error when trying to replay them. The workaround to this is to modify the property in question to enforce immediate failure (e.g. by inserting return false) after logging the generated inputs and manually inserting a unit test specifically for those inputs instead. This should be done with Supposition.jl v0.2.1. Afterwards, you can delete the test/SuppositionDB directory to have Supposition.jl record newly found failures, and update Supposition.jl. Be sure to add ]compat notices, since this change is not backwards compatible.

New Features

  • event!, allowing recording of interesting objects (and optionally an associated label) in a property that may help with further debugging. See also the dedicated docs section for this.
  • WeightedNumbers, for drawing a number from 1:n with the probability of picking any given number biased by a given Vector{Float64} of n weights.
  • WeightedSample, for drawing from an n-element long collection biased by an n-element long Vector{Float64} of weights.
  • Both @composed and @check now allow anonymous functions to be used, making for easier creation of properties and composed Possibility without the need for coming up with a name. See the respective docstrings of @check and @composed for more information.

Bug Fixes

  • @check now always shows the duration taken, even when invoked standalone at the top level and not just when in a @testset.
1 Like

why not use existing functionality in statistical packages for this purpose?

1 Like

That’s a very good question, but the answer is a bit more involved.

One key aspect of Supposition.jl is the ability to shrink found counterexamples in some predictable manner. Vectors shrink in length, as well as shrinking each element in that vector. Tuples (created through e.g. @composed tuple(Data.BitIntegers(), Data.Floats())) shrink by shrinking their elements. The key is that this shrinking process keeps existing invariants of the various Possibility objects that generated one of those objects in the first place in tact. For example:

julia> using Supposition

julia> data = filter(iseven, Data.Integers{Int16}());

julia> vec = Data.Vectors(data; min_size=3, max_size=20);

julia> example(vec, 4)
4-element Vector{Vector{Int16}}:
 [-4326, -25920, -14638, 11838, 17878, -4306, -30360]
 [-2052, -4920, -9792]
 [-26346, -18678, -22036, -12032, -6264]
 [29074, -18096, -17534, 3048]

In this example, any example that vec produces is guaranteed to always be a Vector{Int16}, the elements of which are always even and the length of which is always in 3:20. The examples also shrink predictably; the shrunk vectors will also always have at least 3 and at most 20 elements, while the shrunk elements will always be even. That is to say, everything involved here has a deterministic minimum shrink (up to choice of property, of course). For a property that always returns false, the same input should shrink to the exact same minimal counterexample (though different inputs may shrink to different minimal counterexamples, as long as they always shrink to that same one).

So if this were only about drawing some random object from a collection, indeed, the standard statistical tools would be enough! Unfortunately, that’s not sufficient, because using those tools doesn’t easily allow controlled shrinking because of the way they’re interacting with RNG. The way Supposition.jl works is by recording the binary choices that led to a given input, and removing or modifying that sequence of binary choices to produce an example that requires fewer such choices that still lead to the same error/failure.

Now, if I’d just call rand(rng, <distribution>) when generating an example, all Supposition.jl would have to go on for shrinking would be the initial seed of that rng! By its very nature of a PRNG though, this is pretty much a worst case scenario because tiny changes in seed have huge influence in the resulting number stream that statistical package sees, which has a huge knock-on effect on what data is actually generated. There is no easy correlation between what ends up in your test case and what the initial seed for that PRNG was (otherwise, our RNG would be quite broken), so it’s extremely difficult to shrink elements produced through calls to rand in a predictable or good way (Supposition.jl does manage to make this process deterministic, but it can’t really do much with that - the determinism is only sufficient to give replayability, but not good shrinkability).

Admittedly, I don’t know whether the particular algorithm I use here is implemented in a standard statistical tool in the ecosystem, but the important part is that it’s Supposition.jl which controls the actual sampling step and records any choices required to do that, in order to control how that sampled value shrinks. This is why a special construction is required, to make the shrinking well-behaved and predictable.

Well, and on top of this - the construction in Supposition.jl has extremely nice big-O properties; it’s based on this explanation of sampling loaded dice by Keith Schwartz, with \Omega(n) initialization time, \Omega(1) generation time and \Omega(n) memory usage (3n, to be precise). The original paper that explanation is based on is “A Linear Algorithm For Generating Random Numbers With a Given Distribution”.

I hope that answers your question!

4 Likes

I see. I’m not familiar with the details, but that explanation passes the vibe check for me :slight_smile:

the construction in Supposition.jl has extremely nice big-O properties;

right, I saw you’re making an alias table. not that it’s important at this point, but just fyi StatsBase.jl does have one of these as well

1 Like

In a nutshell, it’s the difference between this:

julia> using Supposition

julia> vec = Data.Vectors(Data.Integers{UInt8}();min_size=5,max_size=5);

julia> @check function third_is_not_five(v=vec)
           v[3] != 5
       end;
┌ Error: Property doesn't hold!
│   Description = "third_is_not_five"
│   Example = (v = UInt8[0x00, 0x00, 0x05, 0x00, 0x00],)
└ @ Supposition ~/Documents/projects/Supposition.jl/src/testset.jl:292
Test Summary:     | Fail  Total  Time
third_is_not_five |    1      1  0.0s

giving [0x0, 0x0, 0x5, 0x0, 0x0] and this:

EDIT: This example is slightly buggy; see my next reply for why!

julia> @check function third_is_not_five(v=vec)
           v .= rand(UInt8, 5)
           event!("ACTUAL DATA", v)
           v[3] != 5
       end;
Events occured: 1
    ACTUAL DATA
        UInt8[0x07, 0xbd, 0x05, 0xf4, 0xee]
┌ Error: Property doesn't hold!
│   Description = "third_is_not_five"
│   Example = (v = UInt8[0x55, 0x63, 0x3b, 0xf0, 0x2b],)
└ @ Supposition ~/Documents/projects/Supposition.jl/src/testset.jl:292
Test Summary:     | Fail  Total  Time
third_is_not_five |    1      1  0.0s

giving some actually random vector UInt8[0x07, 0xbd, 0x05, 0xf4, 0xee]. I even had to add the event! there just to show it, since the actual data produced by the generator doesn’t matter anymore! The former is much nicer to work with when you’re trying to debug something - the latter is (as it should be!) uniformly random. That’s just not very helpful when trying to debug some bug, since nothing about that springs out as being obviously wrong, or being easy to keep in your head when you’re debugging :slight_smile: For more complicated properties, you can more or less give up all hope of shrinking anything at all.

Interesting, thank you for pointing that out! I’ll have to take a look at their implementation, maybe there’s some smart tricks they’re doing when building it that I could incorporate too.

An additional thank you for asking @adienes, since the example I showed above actually uncovered a bug; the correct “counterexample” that Supposition.jl should produce (and now does with v0.3.1, which should be available any minute now) is of course UInt8[0x00, 0x00, 0x00, 0x00, 0x00].

julia> using Supposition

julia> vec = Data.Vectors(Data.Integers{UInt8}();min_size=5,max_size=5);

julia> @check db=false function third_is_not_five(v=vec)
           v .= rand(UInt8, 5)
           event!("ACTUAL DATA", v)
           v[3] != 5
       end;
Events occured: 1
    ACTUAL DATA
        UInt8[0xaf, 0xf1, 0x05, 0x9d, 0x7b]
┌ Error: Property doesn't hold!
│   Description = "third_is_not_five"
│   Example = (v = UInt8[0x00, 0x00, 0x00, 0x00, 0x00],)
└ @ Supposition ~/Documents/projects/Supposition.jl/src/testset.jl:292
Test Summary:     | Fail  Total  Time
third_is_not_five |    1      1  0.3s

What’s happening is that the data Supposition.jl generates for the vector doesn’t matter, so it shrinks it to the minimal allowed example that can be produced; a vector of length exactly 5 (since that’s what I’m limiting it to in vec) that’s all 0x0. The contents of the vector are completely controlled by rand, but figuring out a seed that produces exactly UInt8[0x00, 0x00, 0x05, 0x00, 0x00] is near impossible. It might even be completely impossible, if no seed exists that produces that output on its first try.

The same is true if I were to use an existing statistical sampler - Supposition.jl would have to figure out the non-trivial non-linear relationship between the seed of the RNG, and the output of that sampler; a task that, depending on the PRNG, even Z3 can struggle with immensely (and that’s only for one call to rand - just imagine how bad this gets with multiple calls to rand, each modifying the default_rng. The knock-on effects are a nightmare). Supposition only manages to find an example because out of all 256 possibilities for UInt8, for the 10_000 samples it tries by default we’d expect to get 0x05 at index 3 about 39 times.

2 Likes

I’ve been waiting for this. PropCheck.jl helped me catch many issues with my personal projects. I might have time tonight to migrate my tests to this package.

2 Likes

I just have to say, the Alignment with Documentation page of the Supposition.jl documentation is excellent.

In the case of floating point addition not associative resulting in sincosd properties not checkable for exact values, we’d have to settle with a tolerance right? Is that worth exemplifying on that same documentation page? Maybe a note in the sincosd docstring as to the tolerance that associativity was checked with?

1 Like

Are you sure it’ll ever be? Formal verification requires proofs. Sometimes it’s too complex to rule out every possible cases that things could go wrong. In many applications, demonstrating beyond reasonable doubt is sufficient. Even if it’s possible, it might require users to write proofs for it to be the case, which can be really difficult. (Some conjectures take hundreds of years to prove informally, using human languages, let alone formal proofs. Then, it’s really difficult to formally prove even simple things like the existence of infinite numbers of primes. ) Maybe some automated proofs assistant can help, but usually with a huge number of limitation (For example, Rust borrow checker can prove no memory leak, but with a huge limitation.) I think this package would have its place even if formal verification is much better.

The trigonometric functions as implemented in Base guarantee a certain level of precision to the “true” result; usually within 1 ULP (unit in last place), if I recall correctly. Nevertheless, even without that, the example with sincosd is mostly about ensuring sincosd and sind/cosd align exactly, since sincosd is supposed to be a faster drop-in replacement for using the original functions seperately, if both results are needed.

If the example were about checking that sind itself were precise enough/within a tolerance, yep, that’d have to be checked as well. That’s usually done by comparing with the same BigFloat computation, and checking that the results are the same up to some chosen tolerance.

I’m not 100% sure of anything when it comes to code, hence the existence of this package :wink: Better test it than to be sorry. To be clear, the “if” I’ve written there is a very big “IF”; quite a lot of properties that humans would like to prove are indeed fiendishly difficult to prove. That being said, I can’t rule out that there won’t be tremendous advances either, so if they happen, I’ll gladly switch to those instead!

2 Likes

I would vote for “partial formal verification” :sweat_smile:

Perspectives differ depending on the area one works in, but IMHO the software industry generally needs some form of “partial formal verification” because software development requires a lot of discipline. The example of the Rust borrow checker is really good from a cultural perspective. It teaches developers (like myself) that there are (some) strict rules to follow; I think of those like the unforgiving laws of physics in fields like mechanical engineering or aerospace. It’s like signing a contract with the compiler to adhere to these rules. If you want to break them, you must explicitly take responsibility by using the “unsafe” keyword. This makes tracking potential issues easier (by searching for “unsafe” in the code for example).

This approach doesn’t cover all possible issues but it significantly reduces the risk of certain classes of bugs.

If “(partial)” formal verification were to take some credit, it should be for helping us understand the concept/abstraction more deeply and avoiding bugs “by design”. As John Hughes mentioned in the Volvo Project (section 3):

Volvo Cars wanted to test their suppliers code for conformance with the
standard, and funded Quviq to develop QuickCheck models for this purpose

For this project we read around 3,000 pages of PDFs (the standards documents). We formalized the specification in 20,000 lines of QuickCheck code.
We used it to test a million lines of C code in total, from 6 different suppliers,
finding more than 200 problems—of which well over 100 were ambiguities or
inconsistencies in the standard itself! In the few cases where we could compare our test code to traditional test suites (in TTCN3), our code was an order of magnitude smaller—but tests more.
So, does the method scale? Yes, it scales

2 Likes