Automation to ensure green CI on master

CI on master is sometimes failing, which makes it harder for new contributors to make PRs because they don’t have a good signal about success/failure. Some projects like Rust ensure that the master branch of rust-lang/rust is always in a valid state.

Are devs opposed to such a system or has nobody got around to setting up the automation or is there a budget problem?

The problem with this is that our test suite is slightly non-deterministic so there are times where someone merges a PR that was green by chance that then ends up having a bug that is detected in say 10% of CI runs which would then block a lot of other PRs from merging while we identify the PR that broke things and fix it.


How (if at all) does rust manage to evade spurious failures? Do we know what kinds of non-determinism in our testsuite often leads to these failures?

Is there a good way to identify tests with nondeterministic outcomes?

example of commits to main branch that failed CI

From talking with some rustaceans I think that’s not really in master. My link above shows Rust has a system of trying multiple commits at once to save time, and in this case one of the commits in a group failed, but the final “rollup” merge commit for that group was successful Auto merge of #114565 - matthiaskrgr:rollup-p7cjs3m, r=matthiaskrgr · rust-lang/rust@72c6b8d · GitHub

I think a good philosophy is described in this issue from a (non-rust but very good) async library , namely applying heavy effort to track down and eliminate each flaky test (and how to do it).

1 Like

What aspect of the CI tests are non-deterministic?

Would it be worthwhile to make all tests deterministic?

There are a number of non-deterministic aspects. Some are intentional (e.g. math tests using random numbers to increase code coverage over time), some are more inherent/annoying (e.g. file state, internet connection issues). The second set are definitely good to remove but the first possibly should stay.

1 Like

One could imagine two sets of tests - with the fully deterministic set never allowed to fail, with stringent things like CI having to pass before merge is allowed.


There are two ways to divide those test sets:

  • split into two phases of execution
  • split into two separate CI tasks

If divided into two execution phases: simpler to implement, perhaps just modify test/runtests.jl.
The deterministic test is executed first, and if it fails, the whole test fails. After passing the deterministic test, continue executing the other tests.
If the CI platform does not support returning multiple states, you may need to manually check the test results of the latter phase.

If you choose to split tests: the number of tests to be run doubles, but you can get a better view of the status of each type of test run.

Not sure what you mean?

What is rust stance on non-deterministic tests? A very superficial search seems to indicate they don’t have them but I couldn’t find any reliable source.

(e.g. math tests using random numbers to increase code coverage over time)

How does using an RNG which can lead to random failures improve test coverage if the first solution to failures is to run the test suite again and hope the RNG plays nicely this time around?


theoretically at least we only rerun tests after looking to see what failed and if it’s potentially real

Hopefully in this case the preferred solution is to use the failure as a bug report and try to fix it. This class of failure is generally easy to reproduce by setting the RNG seed.

Is there a list of tests which are RNG-unstable?

The rng tests aren’t the ones that cause failures, by far the most common failure is network tests.


Would it help if all network tests had their own ci job so one could immediately see that failures are unrelated, and one could run only that one again manually?

1 Like

This discussion is great and I am learning a lot from it. Some of the questions and suggestions are quite pertinent/insightful. I would urge the folks spearheading this questioning to document what is being learnt and to even make pull requests with their suggestions (even if the pull request is draft and incomplete) in order to keep the ball rolling – otherwise we will just have a long thread that becomes outdated. The core devs have not done this not because they do not agree it is valuable but because their TODO lists of valuable work is 10x longer than what they have the time to do.

1 Like