canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00478
Re: Autorestarting jenkins slaves
I spent some time on that issue during my Vanguard shift, see
https://app.asana.com/0/8740321118011/8740321118013 for details, I'll
raise some points and ideas below as this is more work than I thought
and seems worth discussing the various issues and solutions.
>>>>> Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> writes:
> Hi,
> I've discussed that with jamespage and came up with the following
> workaround:
> modified debian/jenkins-slave.upstart
> === modified file 'debian/jenkins-slave.upstart'
> --- debian/jenkins-slave.upstart 2013-02-17 17:11:13 +0000
> +++ debian/jenkins-slave.upstart 2013-12-09 10:29:01 +0000
> @@ -17,3 +17,6 @@
> exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \
> -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS
> end script
> +
> +# respawn if the slave crash
> +respawn
> I've deployed that on jatayu by adding 'respawn' to
> /etc/init/jenkins-slave.conf so daily-release-executor should now
> restart automatically (I've restarted the jenkins-slave service).
> ....
>>>>> Francis Ginther <francis.ginther@xxxxxxxxxxxxx> writes:
> Vila,
> My recommendation is to deprecate /usr/local/bin/start-jenkins-slaves
> and rely on individual upstart jobs, one for each slave.
>>>>> Larry Works <larry.works@xxxxxxxxxxxxx> writes:
> I second the motion for upstart jobs for each individual node.
Looks like we have a consensus on not using /usr/local/bin/start-jenkins-slaves.
>>>>> Larry Works <larry.works@xxxxxxxxxxxxx> writes:
> I also would't mind seeing us get away from using SSH to restart
> remote nodes since that will allow us to eliminate another plugin
> (or three).
Can you elaborate on that ? By 'using SSH to restart remote nodes' you
mean us connecting via ssh and restarting the slaves manually ?
Probably not as I fail to see the link with plugins...
>>>>> Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx> writes:
> On 9 December 2013 13:38, Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> wrote:
>> === modified file 'debian/jenkins-slave.upstart'
>> --- debian/jenkins-slave.upstart 2013-02-17 17:11:13 +0000
>> +++ debian/jenkins-slave.upstart 2013-12-09 10:29:01 +0000
>> @@ -17,3 +17,6 @@
>> exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \
>> -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS
>> end script
>> +
>> +# respawn if the slave crash
>> +respawn
> respawn limit (http://upstart.ubuntu.com/cookbook/#respawn-limit)
> please.
Yup, that was (and still is) on my radar, see
https://app.asana.com/0/8740321118011/9113941145531 .
> Otherwise we will poorly handle the case where the slave is broken
> (remember the corrupted jar?) and cannot actually be started.
I vaguely remember but no details, what was the symptom, how can we
automate a check for that ?
See https://app.asana.com/0/8740321118011/9113941145533 for a proposal
to check the jar validity, feedback welcome.
Now, I stopped counting at 40 when listing all nodes where we want to do
that (see https://app.asana.com/0/8740321118011/9113941145537).
40 is too high for a manual fix and deploy strategy :-/
And at that point I wonder if we really want to keep using jlnp or if
it's worth chosing a different way to connect to the slaves. jenkins
proposes two other methods:
- launch slave agents on Unix machines by using ssh
- launch slave via execution of command on the Master
My understanding (and practice on http://babune.ladeuil.net:24842) is
that the master can (and will) restart the connection when needed
(including when it's lost), so it may be a better fit[1] than addressing
all the issues we're encountering with jlnp.
Thoughts ?
In a nutshell, I feel that we'd be better served in the short term by
restarting the crashed slaves manually with an option of adding
'respawn' when we do that ; and post-pone the better resolution.
Vincent
[1]: That needs to be tested first of course.
Follow ups
References