canonical-ci-engineering team mailing list archive

Thread
Date

Re: On resiliency to failure

To: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
From: Tristram Oaten <tristram.oaten@xxxxxxxxxxxxx>
Date: Tue, 3 Dec 2013 16:17:06 +0000
Cc: Lex Moffitt <nick.moffitt@xxxxxxxxxxxxx>, liam.young@xxxxxxxxxxxxx, Tristram Canonical <tristram.oaten@xxxxxxxxxxxxx>, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAOe9oG6BUq74-L6NnYHQuts5R1iOucN=qbriXr_JabH=STefTQ@mail.gmail.com>
Sender: tristram@xxxxxxxxxx

Django is based on WSGI, does that imply that WSGI is a preferred
technology? I guess not, you can run Python on windows, after all. But I
think we should add it, as for simple apps it is lightning fast, and add a
little Werkzeug library into the mix and you have yourself the bare bones
of what you need to support an app like this.

Perhaps pure WSGI is difficult to find skills for, Flask uses the same
templating syntax as Django and you can bring your choice of ORM, and
generate an admin with the well-used plugin flask-admin.

Django is a great CMS. Use it for anything else and you handcuff your
developers.


On 3 December 2013 16:07, Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx> wrote:

> I had a quick chat with Nick Moffitt and Liam Young of webops/GSA,
> then Tristram of the web team, which I think would be useful to all of
> you.
>
> As we are standardising around a model of using Django 1.5 for the
> individual components (as defined in lp:ubuntu-ci-services-itself,
> docs/style.rst), it's worth thinking about the various ways any one of
> these components can fail.
>
> A broader discussion would be of what happens when a component
> completely goes down and cannot be talked to. What does the other end
> do in this circumstance to gracefully handle the failed request and
> prevent a domino effect? We cannot assume that the REST API we're
> talking to will reply, or reply within a given timeout (and we should
> always be setting timeouts). I won't cover this here, but you should
> definitely be thinking about how to handle it.
>
> So, how can our little Django worker fail? Well, for a start, the node
> it is running on could fall over. That's okay, Django itself is
> horizontally scalable. So we create N wsgi servers (gunicorn) hosting
> the Django code and put them behind HAProxy with a health check set.
>
> With a bit of extra work (we cannot just juju upgrade-charm), this
> would also let us deploy code worker by worker, checking for a bad
> deployment along the way. The online services team is trying to get to
> this deployment strategy in place. It's worth talking to bloodearnest
> if you head down that road.
>
> But Django also talks to a Postgres database. How do we handle
> Postgres falling over and leaving Django with nothing to talk to?
> Pgbouncer helps here. If we put pgbouncer in front of a number of
> postgres instances with a set master instance, we can tolerate some
> fallover. Of course, pgbouncer then becomes a SPOF. From talking to
> Nick it doesn't sound like this has bitten IS often.
>
> It's definitely worth talking to Stuart Bishop (stub) about how to
> best handle postgres in this SOA architecture. He's our in house
> database expert.
>
> Now, replicating postgres like this potentially falls over if we're
> using it to store locks. You've got to wait on pgbouncer to
> synchronise locks across all the postgres nodes.
>
> Also keep in mind whether we really need to store anything in a
> database at all. If you're talking to Launchpad for your information,
> you can probably leave the data there. If you're creating locks, it's
> probably worth rethinking whether you can flip that around and rather
> than go find a place to put a task, whether you can put it on a big
> queue for some workers to grab from.
>
> Expanding on that, just how much of Django do you really need? I can't
> imagine we'll need the administrative interface, the templating
> engine, the ORM, or really anything above the routing code in most
> cases. It's probably worth disabling the rest.
>
> Django is pretty heavyweight. Tristram benchmarked it against Flask
> and others and came up with some interesting results:
> https://workflowy.com/shared/1574979c-4603-a345-a145-a6dbb7174885/
>
> Unfortunately, the Preferred Technologies page pretty much forces us
> to use it, but that doesn't mean we cannot strip it down to just what
> we need in each case.
>
> Attached is the diagram Nick and Liam drew for how we might layout
> each component. Keep in mind this is for a single microrservice. We'd
> want this layout for each one. You can ignore the bit at the top for
> squid. We won't need that on the front of most things. Instead, a
> simple Apache in front of HAProxy will suffice.
>
> For good examples of how to do haproxy in prodstack, both psearch (in
> lp:ubuntuone-servers-deploy) and certification (in
> lp:~canonical-losas/canonical-is-charms/certification) were
> recommended.
>
> Thanks!
>
> (Tristram, Liam, and Nick, if I got any of the above wrong, please do
> correct me.)
>

Follow ups

Re: On resiliency to failure
From: Evan Dandrea, 2013-12-03

References

On resiliency to failure
From: Evan Dandrea, 2013-12-03