canonical-ci-engineering team mailing list archive

Thread
Date

Re: Unstable upstream merger makos and maguros

To: Francis Ginther <francis.ginther@xxxxxxxxxxxxx>
From: Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx>
Date: Wed, 06 Nov 2013 09:59:44 +0100
Cc: Michał Sawicz <michal.sawicz@xxxxxxxxxxxxx>, Thomi Richards <thomi.richards@xxxxxxxxxxxxx>, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAB2r3jLx=waFFKqG8ZrvYJ6JZ_eiWrqSFxjKBAxLUcudE4S-EA@mail.gmail.com> (Francis Ginther's message of "Tue, 5 Nov 2013 23:20:47 -0600")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

>>>>> Francis Ginther <francis.ginther@xxxxxxxxxxxxx> writes:

    > We once again have a number of wedged devices:
    >  ps-maguro-01
    >  ps-maguro-02
    >  ps-maguro-03
    >  ps-mako-01
    >  ps-mako-03

    > This has gone beyond the point where it's just an inconvenience. We
    > have had to disable the maguro testing 3 days in a row, due to loss of
    > devices.

    > We have tried a different host and different hubs, neither made a
    > difference. That leaves either the devices themselves, a problem in
    > the flash tools or the image itself.

Since all bets are on, I'd go with an issue in the flash tools with a
preference for adb ;)

It may be a bug there or a misuse.

    > My current (sleep inspired) thoughts on a course of action:

    >  - Can we engage the foundations team to take a look? Perhaps they can
    > spot something in our infrastructure or actually find an issue. I know
    > Sergio has mentioned disabling MTP on the host, but I don't know the
    > details.

+1

    >  - Use an alternate flash? Saviq mentioned using "system-image-cli
    > --build 0 --verbose" to do a light flash. Does anyone know if this is
    > viable?

Looks like this is discussed already in the other thread ;)

    >  - Throw more devices at the system?

Yeah, brute force can work to a certain extent and should not be ruled
out.

If I understand correctly, the number of failures has raised *because*
we started flashing more often but the failures to flash are not
systematic.

So while I dislike the approach, that may still be the most pragmatic
one.

    > Do we have any spare, especially maguros? I don't want to suggest
    > swapping devices with other those actively being used for fear of
    > de-stabilizing those tests.

    >  - Drop the touch device testing? I don't really want to do this, but
    > if we can't get this stabilized, dealing with the issues may
    > eventually outweigh the benefits.

I strongly oppose disabling tests because we can't make them pass[1].

Apart from that flashing issue we are already unable to run tests on
desktop for radeon (see
https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-24-autopilot-crashes-gnome-session-on-radeon-7750
and the associated
https://bugs.launchpad.net/ubuntu/+source/glamor-egl/+bug/1244324). The
associated fix (which already took ~1 week) has not landed upstream so
the tests against radeon will probably stay disabled for a week or two.

Then there is
https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-11-04-otto-outage
where we reverted the kernel to a previous known-working version which
means we're potentially hiding failures and I don't even know if a bug
has been filed for that.

If the same logic is applied each time we encounter an issue, we'll end
up not running tests at all :)

But even without going that far, reducing the coverage is a Bad Thing.

New bugs are piling up in the uncovered areas, I have no doubt about
that ;)

    > What other approaches have I missed?

Running away ?

Kidding, I agree with your summary.

         Vincent

References

Unstable upstream merger makos and maguros
From: Francis Ginther, 2013-11-06