← Back to team overview

canonical-ci-engineering team mailing list archive

Re: Autorestarting jenkins slaves

 

I spent some time on that issue during my Vanguard shift, see
https://app.asana.com/0/8740321118011/8740321118013 for details, I'll
raise some points and ideas below as this is more work than I thought
and seems worth discussing the various issues and solutions.


>>>>> Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> writes:

    > Hi,
    > I've discussed that with jamespage and came up with the following
    > workaround:

    >    modified      debian/jenkins-slave.upstart
                                                                        

    > === modified file 'debian/jenkins-slave.upstart'
    > --- debian/jenkins-slave.upstart	2013-02-17 17:11:13 +0000
    > +++ debian/jenkins-slave.upstart	2013-12-09 10:29:01 +0000
    > @@ -17,3 +17,6 @@
    >      exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \
    >          -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS 
    >  end script
    > +
    > +# respawn if the slave crash
    > +respawn


    > I've deployed that on jatayu by adding 'respawn' to
    > /etc/init/jenkins-slave.conf so daily-release-executor should now
    > restart automatically (I've restarted the jenkins-slave service).

    > ....
>>>>> Francis Ginther <francis.ginther@xxxxxxxxxxxxx> writes:

    > Vila,
    > My recommendation is to deprecate /usr/local/bin/start-jenkins-slaves
    > and rely on individual upstart jobs, one for each slave.

>>>>> Larry Works <larry.works@xxxxxxxxxxxxx> writes:

    > I second the motion for upstart jobs for each individual node.

Looks like we have a consensus on not using /usr/local/bin/start-jenkins-slaves.


>>>>> Larry Works <larry.works@xxxxxxxxxxxxx> writes:
    > I also would't mind seeing us get away from using SSH to restart
    > remote nodes since that will allow us to eliminate another plugin
    > (or three).

Can you elaborate on that ? By 'using SSH to restart remote nodes' you
mean us connecting via ssh and restarting the slaves manually ?

Probably not as I fail to see the link with plugins...


>>>>> Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx> writes:

    > On 9 December 2013 13:38, Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> wrote:
    >> === modified file 'debian/jenkins-slave.upstart'
    >> --- debian/jenkins-slave.upstart        2013-02-17 17:11:13 +0000
    >> +++ debian/jenkins-slave.upstart        2013-12-09 10:29:01 +0000
    >> @@ -17,3 +17,6 @@
    >> exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \
    >> -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS
    >> end script
    >> +
    >> +# respawn if the slave crash
    >> +respawn

    > respawn limit (http://upstart.ubuntu.com/cookbook/#respawn-limit)
    > please.

Yup, that was (and still is) on my radar, see
https://app.asana.com/0/8740321118011/9113941145531 .


    > Otherwise we will poorly handle the case where the slave is broken
    > (remember the corrupted jar?) and cannot actually be started.

I vaguely remember but no details, what was the symptom, how can we
automate a check for that ?

See https://app.asana.com/0/8740321118011/9113941145533 for a proposal
to check the jar validity, feedback welcome.

Now, I stopped counting at 40 when listing all nodes where we want to do
that (see https://app.asana.com/0/8740321118011/9113941145537).

40 is too high for a manual fix and deploy strategy :-/

And at that point I wonder if we really want to keep using jlnp or if
it's worth chosing a different way to connect to the slaves. jenkins
proposes two other methods:

- launch slave agents on Unix machines by using ssh
- launch slave via execution of command on the Master

My understanding (and practice on http://babune.ladeuil.net:24842) is
that the master can (and will) restart the connection when needed
(including when it's lost), so it may be a better fit[1] than addressing
all the issues we're encountering with jlnp.

Thoughts ?

In a nutshell, I feel that we'd be better served in the short term by
restarting the crashed slaves manually with an option of adding
'respawn' when we do that ; and post-pone the better resolution.

         Vincent

[1]: That needs to be tested first of course.


Follow ups

References