← Back to team overview

canonical-ci-engineering team mailing list archive

Re: Autorestarting jenkins slaves

 

>>>>> Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx> writes:

    > On 11 December 2013 11:27, Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> wrote:
    >> > Otherwise we will poorly handle the case where the slave is broken
    >> > (remember the corrupted jar?) and cannot actually be started.
    >> 
    >> I vaguely remember but no details, what was the symptom, how can we
    >> automate a check for that ?
    >> 
    >> See https://app.asana.com/0/8740321118011/9113941145533 for a proposal
    >> to check the jar validity, feedback welcome.

    > I'm not convinced that it's worth it to explicitly validate the jar.
    > My point was that there are going to be cases where no amount of
    > respawning will bring the slave back to life, and in those cases we
    > should stop after N tries.

Ack, then 'respawn limit 5 3600' will do. We can refine later when we
get better numbers about how long it could take to start a slave when
the master is not there but will come back later or other esoteric
situations ;)

    >> Now, I stopped counting at 40 when listing all nodes where we want to do
    >> that (see https://app.asana.com/0/8740321118011/9113941145537).
    >> 
    >> 40 is too high for a manual fix and deploy strategy :-/

    > Can you please elaborate?

There are two approaches right now (using jlnp):

- /usr/local/bin/start-jenkins-slaves which supports multiple slaves on
  the same host but do not use upstart,

- creating upstart services for each slave by copying the needed files
  and putting the right slave name where needed (including in the file
  names).

Since we don't want the former, the later requires creating 40 copies (I
kind of understand why it's done this way for phones, but 40 ??? /me
faint).

    > I'm not happy about us having that number of nodes not under
    > centralised provisioning, but there are things we can do to
    > mitigate the problem somewhat, like putting a bzr branch of all
    > the code/config that these things could use under /srv (*not* as
    > mounted remote volume) and symlinking to that.

/srv on which server ? And pulled when from the slave nodes ? Or do you
mean on all nodes ?

Once again I'd love to have a package for that and some meta-packages
for servers and slaves which would address the deployment issues but
that sounds overkill just for that specific issue.

Or are you suggesting to do that without packages but with some
branch(es) as a first/interim step ?

None of that is lightweight so far which is why I stopped and thought
about ssh which at least put the burden on the server itself (bar the
ssh key deployment but that should be a once only thing). But then Larry
rightly raised the previous issue with that.

     Vincent


Follow ups

References