canonical-ci-engineering team mailing list archive

Thread
Date

Re: Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)

To: Michi Henning <michi.henning@xxxxxxxxxxxxx>
From: Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
Date: Thu, 6 Nov 2014 09:18:29 +1300
Cc: Thomas Strehl <thomas.strehl@xxxxxxxxxxxxx>, Julien Funk <julien.funk@xxxxxxxxxxxxx>, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>, Leo Arias <leo.arias@xxxxxxxxxxxxx>, Charles Kerr <charles.kerr@xxxxxxxxxxxxx>, "Alejandro J. Cura" <alejandro.cura@xxxxxxxxxxxxx>, Pete Woods <pete.woods@xxxxxxxxxxxxx>
In-reply-to: <E9847020-7D6B-4A4B-97BF-0F8F8381850C@canonical.com>

Hi everyone,


This is a query for the CI team. I've CCed their mailing list in the reply,
but for anything infrastructure-related, you should be talking to someone
on #ci (canonical IRC server) or #ubuntu-ci-eng (freenode IRC server).

CI team: the unity APIs team are seeing test failures since 24th October on
i386 test runners. Could someone please advice them on what they can do to
get this resolved? (see below)


Cheers!

On Wed, Nov 5, 2014 at 9:14 PM, Michi Henning <michi.henning@xxxxxxxxxxxxx>
wrote:

> >
> > I am not aware of anything. You are talking about unit tests, right?
> > Can you please link to one of such failures?
>
> Hi Leo,
>
> basically, the story is that we have unit tests failing on Jenkins. The
> problems started on 24 October (as best I can tell), and they *only* strike
> for the i386 builds. Amd64 and Arm always succeed.
>
> The tests that fail include tests that have *never* (yes, literally never)
> failed before, not on any of our desktops, not on the phone, not on
> anything, including Jenkins. Suddenly, they are failing by the bucket load
> (yes, I really mean bucket load). There is a pattern in the failures: every
> single failure relates to tests that, basically, do something, wait for a
> while, and then check that whatever is supposed to happen has actually
> happened.
>
> The tests are very tolerant in terms of the timing thresholds, so it's not
> as if we are waiting for something that normally takes 1 ms and then fail
> if it hasn't happened after 5 ms. The failures we are talking about are all
> in the 500 ms and greater range. For example, we have seen a test failure
> where we exec a simple process that, once it is started, returns a message.
> Normally (even on the phone), that takes about 120 ms. We wait for 4
> seconds for that test and fail if the message doesn't reach us within that
> time.
>
> We also see failures in a test that runs two threads, one of which does
> something periodically, and the other one waits for the worker thread to
> complete certain tasks. This test has never failed anywhere, and has
> succeeded unchanged for i386 for at least the last six months. Since 24
> October, we are seeing it fail regularly. It is absolutely certain that the
> problem is not with the test (in the sense that there might be a race
> condition or some such). The test runs cleanly with valgrind, helgrind,
> thread sanitizer, address sanitizer, etc., and we wait for half a second
> for something to happen that takes a microsecond to do, and there are no
> other threads busy in the test. Yet, the one runnable thread that does the
> job does not run for half a second.
>
> There are dozens of tests that are (more or less randomly) affected.
> Sometimes this blows up, sometimes that… The failure pattern we are seeing
> is consistent with either a heavily (as in very heavily) loaded machine, or
> some problem with thread scheduling, where threads that are runnable get
> delayed on the order of a second or more.
>
> In summary, everything I'm seeing points to some issue on Jenkins i386,
> because the failures don't happen anywhere else, and happen for tests that
> (unchanged) have succeeded on Jenkins hundreds of times prior to 24 October.
>
> Is there a way to figure what it is going on in the Jenkins
> infrastructure? For example, if the Jenkins build tells me that it is
> happening on cloud-worker-10, is there a way for me to figure out what
> physical machine that corresponds to, and what the load on that machine is
> at the time? I strongly suspect that the problems are either due to the
> build machine trying to do more than it can, or possibly due to I/O
> virtualization? (That second guess may well be wrong, seeing that all our
> comms run over the backplane via Unix domain sockets.)
>
> If you want to see some of the failures, a look through the recent build
> history for unity-scopes-api-devel-ci and
> unity-scopes-api-devel-autolanding shows plenty of failed test runs. The
> failures will probably not mean much to you without knowing our code. But,
> the upshot is that, for every single one of them, the failure is caused by
> something taking orders of magnitude (as in 100-1000 times) longer than
> what is reasonable.
>
> Thanks,
>
> Michi.




-- 
Thomi Richards
thomi.richards@xxxxxxxxxxxxx

Follow ups

Re: Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)
From: Francis Ginther, 2014-11-06