← Back to team overview

canonical-ci-engineering team mailing list archive

Re: Autorestarting jenkins slaves

 

    >> 
    >> Can you elaborate on that ? By 'using SSH to restart remote nodes' you
    >> mean us connecting via ssh and restarting the slaves manually ?
    >> 
    >> Probably not as I fail to see the link with plugins...
    >> 

    > Some of the jenkins slave nodes (mostly but not strictly limited to VMs)
    > are started from the jenkins master via the use of the ssh-slaves
    > plugin. Installing the jenkins-slave package on ALL nodes and starting
    > them from the node instead of via the ssh-slaves plugin from the master
    > would eliminate the need/use of the ssh-slaves plugin as well as the
    > credentials and ssh-credentials plugins.

Right, that's indeed the two conflicting approaches ;) I've used the
'credentials' plugin and it's enough to register a username and an ssh
key as a path (so the private key doesn't have to be stored inside
jenkins which I wouldn't be comfortable with ;). I didn't try to *avoid*
even that and only relies on ~jenkins/.ssh/config which I would prefer.

    > We can still use the libvirt-slaves plugin to launch the VMs as
    > needed and shut them down when not (as it also reverts the VMs to
    > a saved snapshot state and helps lessen the system load on the VM
    > hosting server).

Right.

<snip/>

    >> And at that point I wonder if we really want to keep using jlnp or if
    >> it's worth chosing a different way to connect to the slaves. jenkins
    >> proposes two other methods:
    >> 
    >> - launch slave agents on Unix machines by using ssh
    >> - launch slave via execution of command on the Master

    > I have not tried the latter of the two methods listed above but the
    > first of the two is counter to my comments about using ssh to start
    > slave nodes.

Indeed but I couldn't remember the details, so thanks.

    > This requires the use of three plugins (which, I believe, we are
    > trying to limit the need for as much as possible).

I'd say two seem to be enough (what does the ssh credentials one?) but
yeah.

My campaign against jenkins plugins is based on the issues we encounter
with plugin bugs, updating them and maintaining a coherent set on all
our servers.

So, *by default* I'd like to use less. But mainly less buggy or badly
supported ones ;)

My background with the libvirt, ssh and credentials ones is that they
allowed me to replace a hand-crafted solutions I set up to avoid issues
with jlnp ;) So obviously our experiences differ but also obviously
yours is more pertinent since it comes from our ci lab ;)

So thanks for bringing the "less plugins please" idea back, but let's
continue the discussion ;) I'm happy to contradict myself if we end up
with a more robust solution in the end !

I also consider that if/when we have enough automated tests around these
plugins to stage any update and can confidently and automatically deploy
new versions, the policy can be reconsidered.

    > We have also had issues in the recent past with slave nodes
    > started via the ssh-slaves plugin being able to post their
    > artifacts back to the jenkins master.

*That* is the one I couldn't remember ! I now remember I ended my
campaign for ssh connections at the time because of that ;)

But I stopped arguing then because I thought restarting the slaves
wasn't such a big deal and didn't require as much work as I now realize.

/me sighs

Damned if you do, damned if you don't.

So I see three steps here:

- setup tests for plugin features we care about so we can test new
  features/uses properly,

- investigate the different plugins we want to keep and the ones we want
  to get rid of, preferably with tests, at least for the ones we want to
  keep,

- decide if jlnp is better than ssh.

For ssh the key points are:

- management is done on the server only,

- a single ssh key should be enough for our needs but will need to be
  deployed on all slaves when it changes (rare event), that will also be
  the only thing to deploy,

- needs a nagios alert (no idea how we can do that for now),

- may be invalid if the artifact upload issue can't be fixed[2].


For jlnp:

- needs a custom upstart job[1] and 2 nagios alerts (service down,
  slaves.jar corrupted),

- needs to be deployed on all nodes (>40), so needs to be automated and
  under version control,

- still requires some configuration on the jenkins server (light),

- is known to work.

It's unclear which one requires more work at that point, especially with
some unknowns on both sides ;)

     Vincent

[1]: The upstream upstart job doesn't scale when there are multiple
     slaves on the same host as it's the case for
     daily-release-executor, the acer-veriton nodes, the phones on
     kinnara (and I probably miss some, feedback welcome). So we want a
     solution to automate the duplication (etc/default/jenkins-slave,
     /etc/init/jenkins-slave.conf did I forget one ?) we've been doing
     to handle these specific cases or enhance
     /usr/local/bin/start-jenkins-slaves (but we don't like that
     approach very much).

[2]: But do we know for sure that it's triggered by using an ssh
     connection ?  I have trouble imagining why...


References