canonical-ci-engineering team mailing list archive

Thread
Date

Quick chat with Tom Haddon in IS (layered juju-deployer, push-based nagios checks)

To: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
From: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
Date: Mon, 16 Dec 2013 11:53:08 +0000
Sender: evan@xxxxxxxxxxxxxx

== Layered juju-depoyer ==

I ran into a problem over the weekend in trying to build lp:ci-train
(https://docs.google.com/a/canonical.com/presentation/d/1LiDK3nVWUFKPbOCOPQEWdpXuTVpIQNU2SWjTiQaIGxE/edit).
You can tell the Jenkins charm to use non-ephemeral storage, but you
end up with this race condition where if you don't run
euca-attach-volume at the right moment in a juju-deployer run, you end
up with an install hook error and the entire deploy falls over,
partially finished.

I spoke with Tom about this and they handle this problem in
lp:canonical-mojo by having multiple juju-deployer configurations for
the same environment. So the sames for each deployed service and
underlying charm line up perfectly, but the configuration variables
and relations differ. This lets you do a deployment in stages:

https://code.launchpad.net/~canonical-sysadmins/canonical-mojo-specs/

Roughly:
1) Just deploy the charms without settings.
2) Add the settings.
3) Add the relations.

They then have a script as part of mojo that handles attaching volumes
and other juju-external tasks.

== Push-based nagios ==

Also, he clarified that the way you do nagios checks is by polling
rather than pushing data. So if you have an error state like "lxc-stop
failed for this container", you want to handle that by dropping a file
to a known location. Your nagios check then reports healthy when the
file doesn't exist and alerts when the file does.

This polling means you have a window where failures have occurred but
Nagios and PagerDuty don't know about them yet. On smaller Nagios
deployments we can get away with tuning up the frequency of polls, but
there does not appear to be any other way around this.