canonical-ci-engineering team mailing list archive

Thread
Date

Re: Autorestarting jenkins slaves

To: Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx>
From: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
Date: Wed, 11 Dec 2013 16:23:56 +0000
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <87zjo78w5z.fsf@canonical.com>
Sender: evan@xxxxxxxxxxxxxx

On 11 December 2013 11:27, Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> wrote:
>     > Otherwise we will poorly handle the case where the slave is broken
>     > (remember the corrupted jar?) and cannot actually be started.
>
> I vaguely remember but no details, what was the symptom, how can we
> automate a check for that ?
>
> See https://app.asana.com/0/8740321118011/9113941145533 for a proposal
> to check the jar validity, feedback welcome.

I'm not convinced that it's worth it to explicitly validate the jar.
My point was that there are going to be cases where no amount of
respawning will bring the slave back to life, and in those cases we
should stop after N tries.

> Now, I stopped counting at 40 when listing all nodes where we want to do
> that (see https://app.asana.com/0/8740321118011/9113941145537).
>
> 40 is too high for a manual fix and deploy strategy :-/

Can you please elaborate? I'm not happy about us having that number of
nodes not under centralised provisioning, but there are things we can
do to mitigate the problem somewhat, like putting a bzr branch of all
the code/config that these things could use under /srv (*not* as
mounted remote volume) and symlinking to that.

Follow ups

Re: Autorestarting jenkins slaves
From: Vincent Ladeuil, 2013-12-11

References

Autorestarting jenkins slaves
From: Vincent Ladeuil, 2013-12-09
Re: Autorestarting jenkins slaves
From: Vincent Ladeuil, 2013-12-11