← Back to team overview

canonical-ci-engineering team mailing list archive

Re: Unstable upstream merger makos and maguros

 

>>>>> Francis Ginther <francis.ginther@xxxxxxxxxxxxx> writes:

    > We once again have a number of wedged devices:
    >  ps-maguro-01
    >  ps-maguro-02
    >  ps-maguro-03
    >  ps-mako-01
    >  ps-mako-03

    > This has gone beyond the point where it's just an inconvenience. We
    > have had to disable the maguro testing 3 days in a row, due to loss of
    > devices.

    > We have tried a different host and different hubs, neither made a
    > difference. That leaves either the devices themselves, a problem in
    > the flash tools or the image itself.

Since all bets are on, I'd go with an issue in the flash tools with a
preference for adb ;)

It may be a bug there or a misuse.

    > My current (sleep inspired) thoughts on a course of action:

    >  - Can we engage the foundations team to take a look? Perhaps they can
    > spot something in our infrastructure or actually find an issue. I know
    > Sergio has mentioned disabling MTP on the host, but I don't know the
    > details.

+1

    >  - Use an alternate flash? Saviq mentioned using "system-image-cli
    > --build 0 --verbose" to do a light flash. Does anyone know if this is
    > viable?

Looks like this is discussed already in the other thread ;)

    >  - Throw more devices at the system?

Yeah, brute force can work to a certain extent and should not be ruled
out.

If I understand correctly, the number of failures has raised *because*
we started flashing more often but the failures to flash are not
systematic.

So while I dislike the approach, that may still be the most pragmatic
one.

    > Do we have any spare, especially maguros? I don't want to suggest
    > swapping devices with other those actively being used for fear of
    > de-stabilizing those tests.

    >  - Drop the touch device testing? I don't really want to do this, but
    > if we can't get this stabilized, dealing with the issues may
    > eventually outweigh the benefits.

I strongly oppose disabling tests because we can't make them pass[1].

Apart from that flashing issue we are already unable to run tests on
desktop for radeon (see
https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-24-autopilot-crashes-gnome-session-on-radeon-7750
and the associated
https://bugs.launchpad.net/ubuntu/+source/glamor-egl/+bug/1244324). The
associated fix (which already took ~1 week) has not landed upstream so
the tests against radeon will probably stay disabled for a week or two.

Then there is
https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-11-04-otto-outage
where we reverted the kernel to a previous known-working version which
means we're potentially hiding failures and I don't even know if a bug
has been filed for that.

If the same logic is applied each time we encounter an issue, we'll end
up not running tests at all :)

But even without going that far, reducing the coverage is a Bad Thing.

New bugs are piling up in the uncovered areas, I have no doubt about
that ;)

    > What other approaches have I missed?

Running away ?

Kidding, I agree with your summary.

         Vincent


References