canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00245
Re: Unstable upstream merger makos and maguros
>>>>> Francis Ginther <francis.ginther@xxxxxxxxxxxxx> writes:
> We once again have a number of wedged devices:
> ps-maguro-01
> ps-maguro-02
> ps-maguro-03
> ps-mako-01
> ps-mako-03
> This has gone beyond the point where it's just an inconvenience. We
> have had to disable the maguro testing 3 days in a row, due to loss of
> devices.
> We have tried a different host and different hubs, neither made a
> difference. That leaves either the devices themselves, a problem in
> the flash tools or the image itself.
Since all bets are on, I'd go with an issue in the flash tools with a
preference for adb ;)
It may be a bug there or a misuse.
> My current (sleep inspired) thoughts on a course of action:
> - Can we engage the foundations team to take a look? Perhaps they can
> spot something in our infrastructure or actually find an issue. I know
> Sergio has mentioned disabling MTP on the host, but I don't know the
> details.
+1
> - Use an alternate flash? Saviq mentioned using "system-image-cli
> --build 0 --verbose" to do a light flash. Does anyone know if this is
> viable?
Looks like this is discussed already in the other thread ;)
> - Throw more devices at the system?
Yeah, brute force can work to a certain extent and should not be ruled
out.
If I understand correctly, the number of failures has raised *because*
we started flashing more often but the failures to flash are not
systematic.
So while I dislike the approach, that may still be the most pragmatic
one.
> Do we have any spare, especially maguros? I don't want to suggest
> swapping devices with other those actively being used for fear of
> de-stabilizing those tests.
> - Drop the touch device testing? I don't really want to do this, but
> if we can't get this stabilized, dealing with the issues may
> eventually outweigh the benefits.
I strongly oppose disabling tests because we can't make them pass[1].
Apart from that flashing issue we are already unable to run tests on
desktop for radeon (see
https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-24-autopilot-crashes-gnome-session-on-radeon-7750
and the associated
https://bugs.launchpad.net/ubuntu/+source/glamor-egl/+bug/1244324). The
associated fix (which already took ~1 week) has not landed upstream so
the tests against radeon will probably stay disabled for a week or two.
Then there is
https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-11-04-otto-outage
where we reverted the kernel to a previous known-working version which
means we're potentially hiding failures and I don't even know if a bug
has been filed for that.
If the same logic is applied each time we encounter an issue, we'll end
up not running tests at all :)
But even without going that far, reducing the coverage is a Bad Thing.
New bugs are piling up in the uncovered areas, I have no doubt about
that ;)
> What other approaches have I missed?
Running away ?
Kidding, I agree with your summary.
Vincent
References