← Back to team overview

canonical-ci-engineering team mailing list archive

Re: No image #60 - what went wrong?

 

On 12 December 2013 14:46, Paul Larson <paul.larson@xxxxxxxxxxxxx> wrote:
> On Thu, Dec 12, 2013 at 4:01 AM, Evan Dandrea
> <evan.dandrea@xxxxxxxxxxxxx> wrote:
>> - Siva mentioned that the expected device wasn't appearing in `adb
>> devices`. Can we have a nagios check for this so we know sooner?
> As for detection, we should investigate if there's a good way to do
> this in nagios or in the jobs themselves. I think it sounds feasible
> but bad device detection may be better integrated into the jobs rather
> than relying on an external service that doesn't know what state
> things are expected to be in.

Agreed. Nagios is really just a means of alerting based on some
condition. The jobs could handle identifying when something has gone
awry, tell Jenkins to hold the line, and drop a hint to nagios (a file
in an expected location).

> Another thing that should help is the
> megajob refactor that Andy has been working on.  It would at least
> deal better with a situation where we lose a device and not require
> regenerating all the jobs to get things moving again.  After this goes
> in, I'd like to see about adding some sort of a health check that
> figures out if the device is at least reachable, and marks it
> bad/offline if not. Before that though, we need all the bits in place
> to detect the image on it and reflash if not.

Paul, are you happy to take a task for the health check, pending the
refactor? Can you have it drop a file to hint to nagios that a phone
is dead (removing that file when things are clear)?

Where do we stand on the megajob refactoring, Andy?

> I think there are some things we could do to improve this (see above)
> and continue to look for new ways to make it as reliable as the
> devices will allow us to make it.

Thanks Paul!


Follow ups

References