canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00418
On resiliency to failure
I had a quick chat with Nick Moffitt and Liam Young of webops/GSA,
then Tristram of the web team, which I think would be useful to all of
you.
As we are standardising around a model of using Django 1.5 for the
individual components (as defined in lp:ubuntu-ci-services-itself,
docs/style.rst), it's worth thinking about the various ways any one of
these components can fail.
A broader discussion would be of what happens when a component
completely goes down and cannot be talked to. What does the other end
do in this circumstance to gracefully handle the failed request and
prevent a domino effect? We cannot assume that the REST API we're
talking to will reply, or reply within a given timeout (and we should
always be setting timeouts). I won't cover this here, but you should
definitely be thinking about how to handle it.
So, how can our little Django worker fail? Well, for a start, the node
it is running on could fall over. That's okay, Django itself is
horizontally scalable. So we create N wsgi servers (gunicorn) hosting
the Django code and put them behind HAProxy with a health check set.
With a bit of extra work (we cannot just juju upgrade-charm), this
would also let us deploy code worker by worker, checking for a bad
deployment along the way. The online services team is trying to get to
this deployment strategy in place. It's worth talking to bloodearnest
if you head down that road.
But Django also talks to a Postgres database. How do we handle
Postgres falling over and leaving Django with nothing to talk to?
Pgbouncer helps here. If we put pgbouncer in front of a number of
postgres instances with a set master instance, we can tolerate some
fallover. Of course, pgbouncer then becomes a SPOF. From talking to
Nick it doesn't sound like this has bitten IS often.
It's definitely worth talking to Stuart Bishop (stub) about how to
best handle postgres in this SOA architecture. He's our in house
database expert.
Now, replicating postgres like this potentially falls over if we're
using it to store locks. You've got to wait on pgbouncer to
synchronise locks across all the postgres nodes.
Also keep in mind whether we really need to store anything in a
database at all. If you're talking to Launchpad for your information,
you can probably leave the data there. If you're creating locks, it's
probably worth rethinking whether you can flip that around and rather
than go find a place to put a task, whether you can put it on a big
queue for some workers to grab from.
Expanding on that, just how much of Django do you really need? I can't
imagine we'll need the administrative interface, the templating
engine, the ORM, or really anything above the routing code in most
cases. It's probably worth disabling the rest.
Django is pretty heavyweight. Tristram benchmarked it against Flask
and others and came up with some interesting results:
https://workflowy.com/shared/1574979c-4603-a345-a145-a6dbb7174885/
Unfortunately, the Preferred Technologies page pretty much forces us
to use it, but that doesn't mean we cannot strip it down to just what
we need in each case.
Attached is the diagram Nick and Liam drew for how we might layout
each component. Keep in mind this is for a single microrservice. We'd
want this layout for each one. You can ignore the bit at the top for
squid. We won't need that on the front of most things. Instead, a
simple Apache in front of HAProxy will suffice.
For good examples of how to do haproxy in prodstack, both psearch (in
lp:ubuntuone-servers-deploy) and certification (in
lp:~canonical-losas/canonical-is-charms/certification) were
recommended.
Thanks!
(Tristram, Liam, and Nick, if I got any of the above wrong, please do
correct me.)
Follow ups