CI on master is sometimes failing, which makes it harder for new contributors to make PRs because they don’t have a good signal about success/failure. Some projects like Rust ensure that the master branch of rust-lang/rust is always in a valid state.
Are devs opposed to such a system or has nobody got around to setting up the automation or is there a budget problem?
The problem with this is that our test suite is slightly non-deterministic so there are times where someone merges a PR that was green by chance that then ends up having a bug that is detected in say 10% of CI runs which would then block a lot of other PRs from merging while we identify the PR that broke things and fix it.
I think a good philosophy is described in this issue from a (non-rust but very good) async library https://github.com/python-trio/trio/issues/200 , namely applying heavy effort to track down and eliminate each flaky test (and how to do it).
There are a number of non-deterministic aspects. Some are intentional (e.g. math tests using random numbers to increase code coverage over time), some are more inherent/annoying (e.g. file state, internet connection issues). The second set are definitely good to remove but the first possibly should stay.
One could imagine two sets of tests - with the fully deterministic set never allowed to fail, with stringent things like CI having to pass before merge is allowed.
If divided into two execution phases: simpler to implement, perhaps just modify test/runtests.jl.
The deterministic test is executed first, and if it fails, the whole test fails. After passing the deterministic test, continue executing the other tests.
If the CI platform does not support returning multiple states, you may need to manually check the test results of the latter phase.
If you choose to split tests: the number of tests to be run doubles, but you can get a better view of the status of each type of test run.
What is rust stance on non-deterministic tests? A very superficial search seems to indicate they don’t have them but I couldn’t find any reliable source.
(e.g. math tests using random numbers to increase code coverage over time)
How does using an RNG which can lead to random failures improve test coverage if the first solution to failures is to run the test suite again and hope the RNG plays nicely this time around?
Hopefully in this case the preferred solution is to use the failure as a bug report and try to fix it. This class of failure is generally easy to reproduce by setting the RNG seed.
Would it help if all network tests had their own ci job so one could immediately see that failures are unrelated, and one could run only that one again manually?
This discussion is great and I am learning a lot from it. Some of the questions and suggestions are quite pertinent/insightful. I would urge the folks spearheading this questioning to document what is being learnt and to even make pull requests with their suggestions (even if the pull request is draft and incomplete) in order to keep the ball rolling – otherwise we will just have a long thread that becomes outdated. The core devs have not done this not because they do not agree it is valuable but because their TODO lists of valuable work is 10x longer than what they have the time to do.