canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00104
More thoughts on https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-09-lxc-dbus-apparmor-otto
I thought about that incident a bit more and there are several points
I'd like to discuss.
AIUI, otto requires an up-to-date graphic driver in the kernel. This
implies running the latest release on the host so the lxc container can
use it.
The container itself use a snapshot to run the tests to guarantee
isolation.
So far so good.
But the failure we saw in the incident highlights the weakness in the
model: if for some reason the system running on the host is broken by an
update, no more tests can be run.
So first, we need to have a check (or several) that the host provides
the right base for the container.
Then we need to decide whether we accept the risk to get a broken *host*
or if we need a pre-requisite test suite to accept such upgrades.
A first line of defense could be to have some smoke runs after an
upgrade to ensure the host system is still usable.
An alternative would be to have a dual-boot so we can experiment on one
boot without breaking the production one.
Another alternative would be to run precise (for consistency with other
physical hosts we manage (or raring or whatever is stable enough) and
have a kvm into which we can run the latest release and build the lxc
only there. I'm not sure we can give access to the graphic card this way
though.
Which brings another point: do we really need to run all tests against
all graphic drivers (currently intel and nvidia, ati being on hold
?). Or can we just use them all as a pool to spread the load and
consider that the tests are valid if they pass on any of them (we'll get
some validation on all of them in the end just by running newer jobs).
Francis started to reply to that on IRC saying:
<fginther> vila, This should be revisited, but I defer to the unity team (bregma) on wether or not that is still needed.
In any case, I'd like these points to be discussed because this incident
can happen again, we're still fully exposed and having to get back to
that kind of questions from a failing autopilot test takes time and is
not (IMHO) the output expected from a ci system ;)
Ideally for this case a single would have failed pointing to an issue in
either the lxc or the host. A very coarse test but they are usually
easier to start with, they can be refined when new failures appear.
I realize this is a bit of brain dump, but we encountered a simpler (but
similar) issue with sst/selenium/firefox (sst automates website testing
by piloting a browser).
For pay and sso, some sst tests were designed to validate new pay and
sso server releases.
sst is implemented on top of selenium which itself guarantee the
compatibility with firefox.
When firefox releases a new version, selenium often has to catch up and
release a new version too.
There were gaps between ff and selenium releases during which the sst
tests were failing: wrong output.
The solution we came up with was to run a different job with the
upcoming firefox version so we got early warnings that selenium will
break.
This never was fully implemented but the above describes the principle
enough (ask for clarifications otherwise ;): critical (for the ci
engine) updates should be staged.
The otto jobs do not stage these updates (and may that's not possible
since they also need to validate the graphic driver). But the point
remains: the jobs should not be able to break the ci engine and deliver
misleading results.
And while
https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-16
has not been fully diagnosed, it's another case whether a job is
directly involved in breaking a part of the ci engine.
What do you think ?
Vincent
Follow ups