canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00945
Re: Invitation: Sync with QA about automation tasks on the backlog @ Thu Nov 6, 2014 4pm - 4:30pm (Thomas Strehl)
Michi,
There is something bad happening on at two of the cloud-worker nodes, 03
and 09, and I've disabled them until we can take a closer look. These two
nodes were used for all of the failures that I found when looking back
through the prior test runs except for a few that occurred before Oct 24.
I've poked around enough to see that user space processes are occasionally
being held off for periods of 5-10 seconds which syncs to the description
of the problem Michi provided. A closer look will have to wait until
tomorrow.
Francis
On Wed, Nov 5, 2014 at 2:18 PM, Thomi Richards <thomi.richards@xxxxxxxxxxxxx
> wrote:
> Hi everyone,
>
>
> This is a query for the CI team. I've CCed their mailing list in the
> reply, but for anything infrastructure-related, you should be talking to
> someone on #ci (canonical IRC server) or #ubuntu-ci-eng (freenode IRC
> server).
>
> CI team: the unity APIs team are seeing test failures since 24th October
> on i386 test runners. Could someone please advice them on what they can do
> to get this resolved? (see below)
>
>
> Cheers!
>
> On Wed, Nov 5, 2014 at 9:14 PM, Michi Henning <michi.henning@xxxxxxxxxxxxx
> > wrote:
>
>> >
>> > I am not aware of anything. You are talking about unit tests, right?
>> > Can you please link to one of such failures?
>>
>> Hi Leo,
>>
>> basically, the story is that we have unit tests failing on Jenkins. The
>> problems started on 24 October (as best I can tell), and they *only* strike
>> for the i386 builds. Amd64 and Arm always succeed.
>>
>> The tests that fail include tests that have *never* (yes, literally
>> never) failed before, not on any of our desktops, not on the phone, not on
>> anything, including Jenkins. Suddenly, they are failing by the bucket load
>> (yes, I really mean bucket load). There is a pattern in the failures: every
>> single failure relates to tests that, basically, do something, wait for a
>> while, and then check that whatever is supposed to happen has actually
>> happened.
>>
>> The tests are very tolerant in terms of the timing thresholds, so it's
>> not as if we are waiting for something that normally takes 1 ms and then
>> fail if it hasn't happened after 5 ms. The failures we are talking about
>> are all in the 500 ms and greater range. For example, we have seen a test
>> failure where we exec a simple process that, once it is started, returns a
>> message. Normally (even on the phone), that takes about 120 ms. We wait for
>> 4 seconds for that test and fail if the message doesn't reach us within
>> that time.
>>
>> We also see failures in a test that runs two threads, one of which does
>> something periodically, and the other one waits for the worker thread to
>> complete certain tasks. This test has never failed anywhere, and has
>> succeeded unchanged for i386 for at least the last six months. Since 24
>> October, we are seeing it fail regularly. It is absolutely certain that the
>> problem is not with the test (in the sense that there might be a race
>> condition or some such). The test runs cleanly with valgrind, helgrind,
>> thread sanitizer, address sanitizer, etc., and we wait for half a second
>> for something to happen that takes a microsecond to do, and there are no
>> other threads busy in the test. Yet, the one runnable thread that does the
>> job does not run for half a second.
>>
>> There are dozens of tests that are (more or less randomly) affected.
>> Sometimes this blows up, sometimes that… The failure pattern we are seeing
>> is consistent with either a heavily (as in very heavily) loaded machine, or
>> some problem with thread scheduling, where threads that are runnable get
>> delayed on the order of a second or more.
>>
>> In summary, everything I'm seeing points to some issue on Jenkins i386,
>> because the failures don't happen anywhere else, and happen for tests that
>> (unchanged) have succeeded on Jenkins hundreds of times prior to 24 October.
>>
>> Is there a way to figure what it is going on in the Jenkins
>> infrastructure? For example, if the Jenkins build tells me that it is
>> happening on cloud-worker-10, is there a way for me to figure out what
>> physical machine that corresponds to, and what the load on that machine is
>> at the time? I strongly suspect that the problems are either due to the
>> build machine trying to do more than it can, or possibly due to I/O
>> virtualization? (That second guess may well be wrong, seeing that all our
>> comms run over the backplane via Unix domain sockets.)
>>
>> If you want to see some of the failures, a look through the recent build
>> history for unity-scopes-api-devel-ci and
>> unity-scopes-api-devel-autolanding shows plenty of failed test runs. The
>> failures will probably not mean much to you without knowing our code. But,
>> the upshot is that, for every single one of them, the failure is caused by
>> something taking orders of magnitude (as in 100-1000 times) longer than
>> what is reasonable.
>>
>> Thanks,
>>
>> Michi.
>
>
>
>
> --
> Thomi Richards
> thomi.richards@xxxxxxxxxxxxx
>
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to : canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help : https://help.launchpad.net/ListHelp
>
>
--
Francis Ginther
Canonical - Ubuntu Engineering - Continuous Integration Team
Follow ups
References