← Back to team overview

canonical-ci-engineering team mailing list archive

Re: Autorestarting jenkins slaves

 

On 12/11/2013 06:27 AM, Vincent Ladeuil wrote:
> I spent some time on that issue during my Vanguard shift, see
> https://app.asana.com/0/8740321118011/8740321118013 for details, I'll
> raise some points and ideas below as this is more work than I thought
> and seems worth discussing the various issues and solutions.
>
>
>>>>>> Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> writes:
>     > Hi,
>     > I've discussed that with jamespage and came up with the following
>     > workaround:
>
>     >    modified      debian/jenkins-slave.upstart
>                                                                         
>
>     > === modified file 'debian/jenkins-slave.upstart'
>     > --- debian/jenkins-slave.upstart	2013-02-17 17:11:13 +0000
>     > +++ debian/jenkins-slave.upstart	2013-12-09 10:29:01 +0000
>     > @@ -17,3 +17,6 @@
>     >      exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \
>     >          -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS 
>     >  end script
>     > +
>     > +# respawn if the slave crash
>     > +respawn
>
>
>     > I've deployed that on jatayu by adding 'respawn' to
>     > /etc/init/jenkins-slave.conf so daily-release-executor should now
>     > restart automatically (I've restarted the jenkins-slave service).
>
>     > ....
>>>>>> Francis Ginther <francis.ginther@xxxxxxxxxxxxx> writes:
>     > Vila,
>     > My recommendation is to deprecate /usr/local/bin/start-jenkins-slaves
>     > and rely on individual upstart jobs, one for each slave.
>
>>>>>> Larry Works <larry.works@xxxxxxxxxxxxx> writes:
>     > I second the motion for upstart jobs for each individual node.
>
> Looks like we have a consensus on not using /usr/local/bin/start-jenkins-slaves.
>
>
>>>>>> Larry Works <larry.works@xxxxxxxxxxxxx> writes:
>     > I also would't mind seeing us get away from using SSH to restart
>     > remote nodes since that will allow us to eliminate another plugin
>     > (or three).
>
> Can you elaborate on that ? By 'using SSH to restart remote nodes' you
> mean us connecting via ssh and restarting the slaves manually ?
>
> Probably not as I fail to see the link with plugins...
>
Some of the jenkins slave nodes (mostly but not strictly limited to VMs)
are started from the jenkins master via the use of the ssh-slaves
plugin. Installing the jenkins-slave package on ALL nodes and starting
them from the node instead of via the ssh-slaves plugin from the master
would eliminate the need/use of the ssh-slaves plugin as well as the
credentials and ssh-credentials plugins. We can still use the
libvirt-slaves plugin to launch the VMs as needed and shut them down
when not (as it also reverts the VMs to a saved snapshot state and helps
lessen the system load on the VM hosting server).
>>>>>> Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx> writes:
>     > On 9 December 2013 13:38, Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> wrote:
>     >> === modified file 'debian/jenkins-slave.upstart'
>     >> --- debian/jenkins-slave.upstart        2013-02-17 17:11:13 +0000
>     >> +++ debian/jenkins-slave.upstart        2013-12-09 10:29:01 +0000
>     >> @@ -17,3 +17,6 @@
>     >> exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \
>     >> -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS
>     >> end script
>     >> +
>     >> +# respawn if the slave crash
>     >> +respawn
>
>     > respawn limit (http://upstart.ubuntu.com/cookbook/#respawn-limit)
>     > please.
>
> Yup, that was (and still is) on my radar, see
> https://app.asana.com/0/8740321118011/9113941145531 .
>
>
>     > Otherwise we will poorly handle the case where the slave is broken
>     > (remember the corrupted jar?) and cannot actually be started.
>
> I vaguely remember but no details, what was the symptom, how can we
> automate a check for that ?
>
> See https://app.asana.com/0/8740321118011/9113941145533 for a proposal
> to check the jar validity, feedback welcome.
>
> Now, I stopped counting at 40 when listing all nodes where we want to do
> that (see https://app.asana.com/0/8740321118011/9113941145537).
>
> 40 is too high for a manual fix and deploy strategy :-/
>
> And at that point I wonder if we really want to keep using jlnp or if
> it's worth chosing a different way to connect to the slaves. jenkins
> proposes two other methods:
>
> - launch slave agents on Unix machines by using ssh
> - launch slave via execution of command on the Master
I have not tried the latter of the two methods listed above but the
first of the two is counter to my comments about using ssh to start
slave nodes. This requires the use of three plugins (which, I believe,
we are trying to limit the need for as much as possible). We have also
had issues in the recent past with slave nodes started via the
ssh-slaves plugin being able to post their artifacts back to the jenkins
master.
>
> My understanding (and practice on http://babune.ladeuil.net:24842) is
> that the master can (and will) restart the connection when needed
> (including when it's lost), so it may be a better fit[1] than addressing
> all the issues we're encountering with jlnp.
>
> Thoughts ?
>
> In a nutshell, I feel that we'd be better served in the short term by
> restarting the crashed slaves manually with an option of adding
> 'respawn' when we do that ; and post-pone the better resolution.
>
>          Vincent
>
> [1]: That needs to be tested first of course.
>



Follow ups

References