← Back to team overview

canonical-ci-engineering team mailing list archive

Re: No image #60 - what went wrong?

 

On Thu, Dec 12, 2013 at 4:01 AM, Evan Dandrea
<evan.dandrea@xxxxxxxxxxxxx> wrote:
> Image #60 was not successfully produced. This obviously created
> problems for the landing team. Let's take a few minutes to brainstorm
> on what went wrong and come up with some better policy to prevent this
> from happening again.
>
> - Why wasn't an email sent out to the teams affected by the moving of
> phones to a new host at least a day or two in advance of the move? We
> have done this before with scheduled DC maintenance and with the 1SS
> move. We should be doing it every time we make a change that affects
> other teams.
+1, more warning would have been good.  We had discussed moving the
phones before to reduce SPOF and minimize impact of the recent adb
restarts that were needed a while back, should they happen again. Late
yesterday afternoon it was done, and given some testing, but it was
certainly too short of notice and would have been better to schedule
it with more warning.
>
> - I realise they were unrelated events, but was there any conceivable
> way we could've caught the device failure that followed? That is,
> could we have kicked off some test runs, or aligned the move to the
> image production time? The answer here may be no, but I want to at
> least discuss why.
You're right, they were unrelated. I made sure both mako and maguro
were running and had cleared the first several jobs before going to
sleep last night, but in one of the later jobs it went missing. The
device seems to be lost to ADB right now and we can't reach it
remotely. We'll need Rick to investigate in person.

> - Siva mentioned that the expected device wasn't appearing in `adb
> devices`. Can we have a nagios check for this so we know sooner?
As for detection, we should investigate if there's a good way to do
this in nagios or in the jobs themselves. I think it sounds feasible
but bad device detection may be better integrated into the jobs rather
than relying on an external service that doesn't know what state
things are expected to be in.  Another thing that should help is the
megajob refactor that Andy has been working on.  It would at least
deal better with a situation where we lose a device and not require
regenerating all the jobs to get things moving again.  After this goes
in, I'd like to see about adding some sort of a health check that
figures out if the device is at least reachable, and marks it
bad/offline if not. Before that though, we need all the bits in place
to detect the image on it and reflash if not.

> - Is there anything else you think we could've done to better manage
> this? Short of moving to the Airline, are there things you think we
> could be doing to make us more resilient to this kind of failure?
When I noticed late last night that 60 didn't work, 61 had *just*
started. I got things moving again so that 61 would surely work and
thought about sending a mass email about it, and should have probably
gone with that first instinct. I figured that Didier would very likely
go looking at the results and poke psivaa about them first though, so
I wanted to make sure at least he knew what was going on, and would
either say 61 was more important anyway, or we could push a way to
rerun 60 in the morning if we need the results from that one too.
I think there are some things we could do to improve this (see above)
and continue to look for new ways to make it as reliable as the
devices will allow us to make it.


Follow ups

References