← Back to team overview

canonical-ci-engineering team mailing list archive

Re: objections to uservice_utils project on pypi?

 

I do enjoy architecture discussions, but if we don't plan for that time in
Scrum, we're going to put the sprint at risk.

That said, I don't want to get the last word in. I only mention this as a
note for the retrospective: we need to have a time-boxed and planned for
space for considering changes to our development process. Something like a
spike story with acceptance criteria.

On 2 April 2015 at 01:03, Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
wrote:

> Yes please. I'd love to know how our CD system is supposed to work.
>

Sorry. Clearly Celso and I haven't done a good enough job explaining this
or encouraging more of you to work on it.


>  * How should CD interact with the development cycle? phrased differently:
> as a developer, how do I get told that the thing I just proposed for
> merging is horribly busted and doesn't deploy cleanly?
>

You don't. That's not the job of CD.

It works like this:

1) You propose a branch, Tarmac runs any unit tests, checks that there are
enough reviews and lands it.
2) CD.py notices that trunk (or the mojo spec trunk) is newer than what's
out there.
2a) It calls mojo.py to deploy this new revno, parallel to the existing
deployment.

*Asynchronous* to this process are:
a) Things that check whether a set of workers at a revno are healthy and,
if not, cut them off from Rabbit and the world (circuit breakers; to be
written¹).
b) Things that kill old, unneeded sets of workers at particular revnos (to
be written).

The use of a plural is intentional. Everything around CI, CD, and the
monitoring/management of the deployed infrastructure follows the
microservices model. There is not one giant circuit breaker to rule them
all, nor should there be one juju reaper.

This is much different to how we used to do things with deploy.py. In that
we followed a model where we were trying to build something that could be
proven to work before we gave it any production data. In this model we
assume that the code is woefully busted, but with the understanding that
only the full production environment can make that determination with
certainty. So we leave it to production to sort itself out.

¹
https://docs.google.com/a/canonical.com/document/d/1GntWmg05h1W6_WF3zveK0oXCcxwl7RASOURvVKipZ5M/edit

 * How should CD interact with ops? How should we be notified of some
> runtime issue? How does CD learn about runtime issues?
>

CD doesn't stick around long enough to find out. Whether the error occurs
from 0, 10 minutes, or 2 days from now, the effect is the same. Let's not
create a split between post-deployment checks and cron'd checks.

You find out that your deployment failed via the very bad not happy time
signals¹ the services generate to feed the circuit breakers, as this data
could also find its way to Kibana or Pager Duty.

¹
https://docs.google.com/a/canonical.com/document/d/1GntWmg05h1W6_WF3zveK0oXCcxwl7RASOURvVKipZ5M/edit

 * How will CD serialise and handle multiple upgrades? for example: what
> happens if we need to roll out component_A and component_B in an atomic
> operation? What should it do if component_A has several revisions that can
> be deployed - deploy them all at once, or deploy them one at a time?
>

Our architecture is designed so that we don't need to want that. This is
part of my worry with going to libraries: it introduces coupling that could
require carefully orchestrated landing of code. Your proposal uses
versioning to work around this, so I'm hoping we don't end up there. If we
do, I do not think the solution is to want an atomic, carefully organised
code landing. It's removing the coupling such that we don't have to care
about landing order or carefully backing things out.

Follow ups

References