canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00511
Re: No image #60 - what went wrong?
I created an asana task for me to look into the health checks.
On Thu, Dec 12, 2013 at 1:02 PM, Evan Dandrea
<evan.dandrea@xxxxxxxxxxxxx> wrote:
> On 12 December 2013 14:46, Paul Larson <paul.larson@xxxxxxxxxxxxx> wrote:
>> On Thu, Dec 12, 2013 at 4:01 AM, Evan Dandrea
>> <evan.dandrea@xxxxxxxxxxxxx> wrote:
>>> - Siva mentioned that the expected device wasn't appearing in `adb
>>> devices`. Can we have a nagios check for this so we know sooner?
>> As for detection, we should investigate if there's a good way to do
>> this in nagios or in the jobs themselves. I think it sounds feasible
>> but bad device detection may be better integrated into the jobs rather
>> than relying on an external service that doesn't know what state
>> things are expected to be in.
>
> Agreed. Nagios is really just a means of alerting based on some
> condition. The jobs could handle identifying when something has gone
> awry, tell Jenkins to hold the line, and drop a hint to nagios (a file
> in an expected location).
>
>> Another thing that should help is the
>> megajob refactor that Andy has been working on. It would at least
>> deal better with a situation where we lose a device and not require
>> regenerating all the jobs to get things moving again. After this goes
>> in, I'd like to see about adding some sort of a health check that
>> figures out if the device is at least reachable, and marks it
>> bad/offline if not. Before that though, we need all the bits in place
>> to detect the image on it and reflash if not.
>
> Paul, are you happy to take a task for the health check, pending the
> refactor? Can you have it drop a file to hint to nagios that a phone
> is dead (removing that file when things are clear)?
>
> Where do we stand on the megajob refactoring, Andy?
>
>> I think there are some things we could do to improve this (see above)
>> and continue to look for new ways to make it as reliable as the
>> devices will allow us to make it.
>
> Thanks Paul!
References