Noisy integration tests (appveyor, travis)

#1

I’ve submitted a few pull requests to Julia in the last year or so. I found the CI a bit noisy, and I wonder if more people have run into the same. Specifically, I’ve seen many more false positives (i.e. failures unrelated to my pull request) than true positives (i.e. issues with my PR).

Most recently, there was some download issue in appveyor’s build for https://github.com/JuliaLang/julia/pull/31792 . I think I’ve seen something similar before.

To be fair, CI has saved my behind a few times as well, particularly on Int = Int32 platforms. But anecdotally, the false positive ratio seems off.

Does that experience mirror other people’s experiences? Would it make sense to capture some statistics on this?

#2

I think that fixing CI quirks is the best solution. This is occasionally tedious though.

The Julia team is mostly aware of these issues and they are working on fixing spurious CI errors, but if you can help, I am sure it would be appreciated.

1 Like
#3

The people who work on Julia day-to-day are all acutely aware of CI flakiness—if you think it’s annoying, imagine dealing with it every day. The biggest problem is that 3rd party CI services like Travis and AppVeyor are really unstable. Not in the sense that they go down, but in the sense that they change all sorts of aspects of the CI environment frequently and without notice, including but not limited to:

  • what kind of VM your CI runs on
  • what OS version your CI runs
  • what versions of software like gcc, clang and gfortran is provided
  • whether your CI has network access to external servers like github.com or not

This stuff changes all the time and often breaks our CI. Worse, we have no control over when this happens and no recourse to roll back any such changes. So it just happens without warning and then we’re left scrambling to try to fix things. And keep in mind that we are paying for these services, not using the free tier.

This is why @staticfloat has put in a lot of work to setup our own buildbot and test servers—so that we can control them. Yes, we have to maintain those servers and and keep them running, but at least they won’t change VM/OS/compiler/network config randomly and without notice or recourse. The transition to these new buildbots has been a bit rough but they’re pretty stable now and since we have those, they should be much more reliable than AppVeyor and Travis have been.

8 Likes
#4

I think that fixing CI quirks is the best solution. This is occasionally tedious though.

The Julia team is mostly aware of these issues and they are working on fixing spurious CI errors, but if you can help, I am sure it would be appreciated.

@Tamas_Papp my apologies for setting you up for “developer complains about free software” / “feel free to send pull requests” . I had a more constructive intention with this topic!

However, fixing CI in this way feels like whack-a-mole: “the tests are a given; let’s throw developer time at fixing the issues as they appear”. But there’s actually opportunity for automation here. For example, the first thing I do when I see a test failure is see whether master builds correctly. That’s something that can be automated.

More interestingly, here’s something that Microsoft wrote about:

[…] we’ve been instituting a formal test reliability process. […] A reliability run picks the latest successful CI build, runs all the tests and looks at the results. Any test that fails is considered flakey (because it previously passed on the same build). The test is disabled and a bug is filed.

Source: https://devblogs.microsoft.com/bharry/testing-in-a-cloud-delivery-cadence/

It’s with options like this in mind that I started this topic. What do you think?

@StefanKarpinski: Thanks for the overview! Is it on the roadmap to completely disable travis / appveyor, then, as soon as we have sufficient confidence in @staticfloat’s work?

2 Likes
#5

Yes. They’ve proven too flaky over the years to be useful. If a CI service isn’t all green almost all of the time, it’s not actually helping. For the amount we pay for CI services, we can easily afford our own compute, the main issue is managing it, but since running CI for the Julia ecosystem is part of the JuliaTeam product roadmap, we’re committed to doing that anyway.

2 Likes
#6

Can’t the first three of them be solved with using a container (eg Docker) image? Obviously that’s easiest for Linux.

That is still a workaround and requires scripting outside the CI process (if I understand correctly). I think the right solution is bringing more parts of the process under control, so I think the buildbot and test servers may be the best solution.

FWIW, I am finding Github + Travis a bit fragile even for applications that are far less complex than Julia, but had better experience with Gitlab lately. It forces one to use a container from the start, which looks like a hassle but in the long run it isn’t. I also expect that if push comes to shove, I could self-host it and transitions seamlessly.

#7

It depends on the kind of problem. Containers would help with the OS or software versions changing as long as the performance characteristics of the underlying hardware/VM/OS don’t change too much. So that would address problems like CI’s copy of libstdc++ or gfortran suddenly changing to something we don’t support, which has happened (and is a big pain) but isn’t that common. So containers would help a bit for Linux CI. On the other hand, I don’t know of any widely used container solutions that run Windows or macOS inside the container. Yes, you can run Linux containers on those OSes (with significant overhead), but that defeats the purpose of running CI jobs on those operating systems, which is to test that Julia works correctly inside of those environments. Maybe there’s something better to be done there. If so, suggestions are welcomed.

We’ve also had many issues that containers would not have helped with. Services change VMs and our CI jobs suddenly don’t have enough memory, or don’t have enough real cores to finish before the time limit, or the time limit is simply reduced without warning. For a long time (possibly still) we were running on free tier VMs shared with other open source projects even though we are a paying customer because CI services aren’t designed to run paid CI on public GitHub repos. In other cases, it has been because we get an open source discount so we pay but we pay a bit less and because of that we get shitty free, shared VMs.

There was one case where some kernel configuration on the underlying machine was changed so that it could supposedly run more concurrent CI jobs but the result for Julia’s CPU-intensive test suite was that we started timing out every time and almost never finishing CI. This caused CI to get restarted a lot, which, of course, uses more compute, not less, over all. Jameson carefully diagnosed the issue—I’m not sure how, it was impressive sleuth work—and we reported it to their support staff but they couldn’t/wouldn’t change the kernel setting. I don’t recall how this got resolved. Probably with us paying them more money. Yet, as this thread shows, paying more money does not seem to solve these problems for long.