canonical-ci-engineering team mailing list archive

Thread
Date
Re: More thoughts on https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-09-lxc-dbus-apparmor-otto

To: Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx>
From: Francis Ginther <francis.ginther@xxxxxxxxxxxxx>
Date: Wed, 16 Oct 2013 22:01:47 -0500
Cc: canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
In-reply-to: <871u3lyvkm.fsf@canonical.com>
Vincent,

I think you raised the essential point: we're performing an update to
the infrastructure without know if it is going to work or having a way
to get back to the last working production environment.

We should be able to define a basic test that is executed after an
update to give us at least minimum confidence that it's going to keep
working. This could start out as a small collection of autopilot test
cases from the archive (so we know they at least passed through CI
once and passed). For maintaining a production environment, dual
booting should work as long as we mark the partitions as 'production',
'testing', 'garbage', etc.

As far as keeping up with running all tests across multiple chipsets,
I really only see this as a need for unity7. And here the issue is as
much identifying regressions in the drivers as it is unity7. As a
consequence, I would prefer to continue running those tests on all
chipsets (including the missing amd/ati).

This is not isolated to the daily release otto systems. Soon we'll
need to upgrade most of the upstream merger systems from raring to
saucy. And just like with the move, the only way to have any
confidence after the update is to run a representative sample of jobs.
The 'good' thing is that this is only an occasional upgrade and in
most cases, we can stagger the updates across redundant systems.

Francis

On Wed, Oct 16, 2013 at 12:28 PM, Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> wrote:
> I thought about that incident a bit more and there are several points
> I'd like to discuss.
>
> AIUI, otto requires an up-to-date graphic driver in the kernel. This
> implies running the latest release on the host so the lxc container can
> use it.
>
> The container itself use a snapshot to run the tests to guarantee
> isolation.
>
> So far so good.
>
> But the failure we saw in the incident highlights the weakness in the
> model: if for some reason the system running on the host is broken by an
> update, no more tests can be run.
>
> So first, we need to have a check (or several) that the host provides
> the right base for the container.
>
> Then we need to decide whether we accept the risk to get a broken *host*
> or if we need a pre-requisite test suite to accept such upgrades.
>
> A first line of defense could be to have some smoke runs after an
> upgrade to ensure the host system is still usable.
>
> An alternative would be to have a dual-boot so we can experiment on one
> boot without breaking the production one.
>
> Another alternative would be to run precise (for consistency with other
> physical hosts we manage (or raring or whatever is stable enough) and
> have a kvm into which we can run the latest release and build the lxc
> only there. I'm not sure we can give access to the graphic card this way
> though.
>
> Which brings another point: do we really need to run all tests against
> all graphic drivers (currently intel and nvidia, ati being on hold
> ?). Or can we just use them all as a pool to spread the load and
> consider that the tests are valid if they pass on any of them (we'll get
> some validation on all of them in the end just by running newer jobs).
>
> Francis started to reply to that on IRC saying:
>
> <fginther> vila, This should be revisited, but I defer to the unity team (bregma) on wether or not that is still needed.
>
> In any case, I'd like these points to be discussed because this incident
> can happen again, we're still fully exposed and having to get back to
> that kind of questions from a failing autopilot test takes time and is
> not (IMHO) the output expected from a ci system ;)
>
> Ideally for this case a single would have failed pointing to an issue in
> either the lxc or the host. A very coarse test but they are usually
> easier to start with, they can be refined when new failures appear.
>
> I realize this is a bit of brain dump, but we encountered a simpler (but
> similar) issue with sst/selenium/firefox (sst automates website testing
> by piloting a browser).
>
> For pay and sso, some sst tests were designed to validate new pay and
> sso server releases.
>
> sst is implemented on top of selenium which itself guarantee the
> compatibility with firefox.
>
> When firefox releases a new version, selenium often has to catch up and
> release a new version too.
>
> There were gaps between ff and selenium releases during which the sst
> tests were failing: wrong output.
>
> The solution we came up with was to run a different job with the
> upcoming firefox version so we got early warnings that selenium will
> break.
>
> This never was fully implemented but the above describes the principle
> enough (ask for clarifications otherwise ;): critical (for the ci
> engine) updates should be staged.
>
> The otto jobs do not stage these updates (and may that's not possible
> since they also need to validate the graphic driver). But the point
> remains: the jobs should not be able to break the ci engine and deliver
> misleading results.
>
> And while
> https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-16
> has not been fully diagnosed, it's another case whether a job is
> directly involved in breaking a part of the ci engine.
>
> What do you think ?
>
>            Vincent
>
>
>
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to     : canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help   : https://help.launchpad.net/ListHelp



-- 
Francis Ginther
Canonical - Ubuntu Engineering - Quality Engineer
References

More thoughts on https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-10-09-lxc-dbus-apparmor-otto
From: Vincent Ladeuil, 2013-10-16