canonical-ci-engineering team mailing list archive

Thread
Date

Re: Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)

To: Francis Ginther <francis.ginther@xxxxxxxxxxxxx>
From: Michi Henning <michi.henning@xxxxxxxxxxxxx>
Date: Thu, 06 Nov 2014 06:12:59 -0000
Cc: Thomas Strehl <thomas.strehl@xxxxxxxxxxxxx>, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>, Julien Funk <julien.funk@xxxxxxxxxxxxx>, Thomi Richards <thomi.richards@xxxxxxxxxxxxx>, Leo Arias <leo.arias@xxxxxxxxxxxxx>, Charles Kerr <charles.kerr@xxxxxxxxxxxxx>, "Alejandro J. Cura" <alejandro.cura@xxxxxxxxxxxxx>, Pete Woods <pete.woods@xxxxxxxxxxxxx>
In-reply-to: <CAB2r3jLeP3Xs+quFx0vvS-Ep4dV26vCVPw5jt4kgNZ8J+s2gmQ@mail.gmail.com>

Hi Francis,

thanks heaps for looking at this! From memory, I've also seen problems with cloud-worker-10 when building for unity-scopes-api-devel-ci.

If the nodes stop periodically, that would be perfectly consistent with the kinds of failures we are seeing, so it seems likely that you are on the right track.

Basically, for our tests to succeed, we need a machine that (roughly) provides the same performance as a phone. If things are four or five times slower than a phone, that's not a problem. But it's basically impossible for us to be resilient when things are 50 times slower or more. In some cases, that would slow the tests down intolerably and, in other cases, we would no longer be testing with any relevance to the real-world execution environment.

Anyway, thanks heaps for looking at this! We have seen similar issues in the past, but never to this degree. Is there a way to install some sort of watchdog process that can alert you to this problem? From our end, when a test fails on Jenkins, it is *very* difficult to establish that something on Jenkins is at fault. We tend to blame ourselves first. If the actual cause is something on Jenkins, it means that we have spent many hours trying to find a fault in our code, only to find out that we were chasing ghosts.

So, some sort of benchmarking process or something that runs periodically and establishes that a test machine delivers expected performance might help?

Cheers,

Michi.


On 6 Nov 2014, at 14:46 , Francis Ginther <francis.ginther@xxxxxxxxxxxxx> wrote:

> Michi,
> 
> There is something bad happening on at two of the cloud-worker nodes, 03 and 09, and I've disabled them until we can take a closer look. These two nodes were used for all of the failures that I found when looking back through the prior test runs except for a few that occurred before Oct 24.
> 
> I've poked around enough to see that user space processes are occasionally being held off for periods of 5-10 seconds which syncs to the description of the problem Michi provided. A closer look will have to wait until tomorrow.
> 
> Francis
> 
> On Wed, Nov 5, 2014 at 2:18 PM, Thomi Richards <thomi.richards@xxxxxxxxxxxxx> wrote:
> Hi everyone,
> 
> 
> This is a query for the CI team. I've CCed their mailing list in the reply, but for anything infrastructure-related, you should be talking to someone on #ci (canonical IRC server) or #ubuntu-ci-eng (freenode IRC server).
> 
> CI team: the unity APIs team are seeing test failures since 24th October on i386 test runners. Could someone please advice them on what they can do to get this resolved? (see below)
> 
> 
> Cheers!
> 
> On Wed, Nov 5, 2014 at 9:14 PM, Michi Henning <michi.henning@xxxxxxxxxxxxx> wrote:
> >
> > I am not aware of anything. You are talking about unit tests, right?
> > Can you please link to one of such failures?
> 
> Hi Leo,
> 
> basically, the story is that we have unit tests failing on Jenkins. The problems started on 24 October (as best I can tell), and they *only* strike for the i386 builds. Amd64 and Arm always succeed.
> 
> The tests that fail include tests that have *never* (yes, literally never) failed before, not on any of our desktops, not on the phone, not on anything, including Jenkins. Suddenly, they are failing by the bucket load (yes, I really mean bucket load). There is a pattern in the failures: every single failure relates to tests that, basically, do something, wait for a while, and then check that whatever is supposed to happen has actually happened.
> 
> The tests are very tolerant in terms of the timing thresholds, so it's not as if we are waiting for something that normally takes 1 ms and then fail if it hasn't happened after 5 ms. The failures we are talking about are all in the 500 ms and greater range. For example, we have seen a test failure where we exec a simple process that, once it is started, returns a message. Normally (even on the phone), that takes about 120 ms. We wait for 4 seconds for that test and fail if the message doesn't reach us within that time.
> 
> We also see failures in a test that runs two threads, one of which does something periodically, and the other one waits for the worker thread to complete certain tasks. This test has never failed anywhere, and has succeeded unchanged for i386 for at least the last six months. Since 24 October, we are seeing it fail regularly. It is absolutely certain that the problem is not with the test (in the sense that there might be a race condition or some such). The test runs cleanly with valgrind, helgrind, thread sanitizer, address sanitizer, etc., and we wait for half a second for something to happen that takes a microsecond to do, and there are no other threads busy in the test. Yet, the one runnable thread that does the job does not run for half a second.
> 
> There are dozens of tests that are (more or less randomly) affected. Sometimes this blows up, sometimes that… The failure pattern we are seeing is consistent with either a heavily (as in very heavily) loaded machine, or some problem with thread scheduling, where threads that are runnable get delayed on the order of a second or more.
> 
> In summary, everything I'm seeing points to some issue on Jenkins i386, because the failures don't happen anywhere else, and happen for tests that (unchanged) have succeeded on Jenkins hundreds of times prior to 24 October.
> 
> Is there a way to figure what it is going on in the Jenkins infrastructure? For example, if the Jenkins build tells me that it is happening on cloud-worker-10, is there a way for me to figure out what physical machine that corresponds to, and what the load on that machine is at the time? I strongly suspect that the problems are either due to the build machine trying to do more than it can, or possibly due to I/O virtualization? (That second guess may well be wrong, seeing that all our comms run over the backplane via Unix domain sockets.)
> 
> If you want to see some of the failures, a look through the recent build history for unity-scopes-api-devel-ci and unity-scopes-api-devel-autolanding shows plenty of failed test runs. The failures will probably not mean much to you without knowing our code. But, the upshot is that, for every single one of them, the failure is caused by something taking orders of magnitude (as in 100-1000 times) longer than what is reasonable.
> 
> Thanks,
> 
> Michi.
> 
> 
> 
> -- 
> Thomi Richards
> thomi.richards@xxxxxxxxxxxxx
> 
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to     : canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help   : https://help.launchpad.net/ListHelp
> 
> 
> 
> 
> -- 
> Francis Ginther
> Canonical - Ubuntu Engineering - Continuous Integration Team

References

Re: Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)
From: Thomi Richards, 2014-11-05
Re: Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)
From: Francis Ginther, 2014-11-06