canonical-ci-engineering team mailing list archive

Thread
Date

Re: On resiliency to failure

To: Tristram Oaten <tristram.oaten@xxxxxxxxxxxxx>
From: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
Date: Tue, 3 Dec 2013 16:28:03 +0000
Cc: Lex Moffitt <nick.moffitt@xxxxxxxxxxxxx>, liam.young@xxxxxxxxxxxxx, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAEuCub6VicFh-icraMwqpXwcXAZos3hOVxYZzHZ+djWCABVVBQ@mail.gmail.com>
Sender: evan@xxxxxxxxxxxxxx

Tristram,

Could you make a case for this on canonical-tech@xxxxxxxxxxxxxxxxxxx?
In our particular case we need to move quickly, so we don't have a lot
of time here to wait for a decision on whether Flask would be okay to
Mark or to ramp up on it. But knowing whether it's an option for
future components of this system would be super-helpful. We'll be
adding quite a few more in phase two of the project, come February.

Thanks!

On 3 December 2013 16:17, Tristram Oaten <tristram.oaten@xxxxxxxxxxxxx> wrote:
> Django is based on WSGI, does that imply that WSGI is a preferred
> technology? I guess not, you can run Python on windows, after all. But I
> think we should add it, as for simple apps it is lightning fast, and add a
> little Werkzeug library into the mix and you have yourself the bare bones of
> what you need to support an app like this.
>
> Perhaps pure WSGI is difficult to find skills for, Flask uses the same
> templating syntax as Django and you can bring your choice of ORM, and
> generate an admin with the well-used plugin flask-admin.
>
> Django is a great CMS. Use it for anything else and you handcuff your
> developers.
>
>
> On 3 December 2013 16:07, Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx> wrote:
>>
>> I had a quick chat with Nick Moffitt and Liam Young of webops/GSA,
>> then Tristram of the web team, which I think would be useful to all of
>> you.
>>
>> As we are standardising around a model of using Django 1.5 for the
>> individual components (as defined in lp:ubuntu-ci-services-itself,
>> docs/style.rst), it's worth thinking about the various ways any one of
>> these components can fail.
>>
>> A broader discussion would be of what happens when a component
>> completely goes down and cannot be talked to. What does the other end
>> do in this circumstance to gracefully handle the failed request and
>> prevent a domino effect? We cannot assume that the REST API we're
>> talking to will reply, or reply within a given timeout (and we should
>> always be setting timeouts). I won't cover this here, but you should
>> definitely be thinking about how to handle it.
>>
>> So, how can our little Django worker fail? Well, for a start, the node
>> it is running on could fall over. That's okay, Django itself is
>> horizontally scalable. So we create N wsgi servers (gunicorn) hosting
>> the Django code and put them behind HAProxy with a health check set.
>>
>> With a bit of extra work (we cannot just juju upgrade-charm), this
>> would also let us deploy code worker by worker, checking for a bad
>> deployment along the way. The online services team is trying to get to
>> this deployment strategy in place. It's worth talking to bloodearnest
>> if you head down that road.
>>
>> But Django also talks to a Postgres database. How do we handle
>> Postgres falling over and leaving Django with nothing to talk to?
>> Pgbouncer helps here. If we put pgbouncer in front of a number of
>> postgres instances with a set master instance, we can tolerate some
>> fallover. Of course, pgbouncer then becomes a SPOF. From talking to
>> Nick it doesn't sound like this has bitten IS often.
>>
>> It's definitely worth talking to Stuart Bishop (stub) about how to
>> best handle postgres in this SOA architecture. He's our in house
>> database expert.
>>
>> Now, replicating postgres like this potentially falls over if we're
>> using it to store locks. You've got to wait on pgbouncer to
>> synchronise locks across all the postgres nodes.
>>
>> Also keep in mind whether we really need to store anything in a
>> database at all. If you're talking to Launchpad for your information,
>> you can probably leave the data there. If you're creating locks, it's
>> probably worth rethinking whether you can flip that around and rather
>> than go find a place to put a task, whether you can put it on a big
>> queue for some workers to grab from.
>>
>> Expanding on that, just how much of Django do you really need? I can't
>> imagine we'll need the administrative interface, the templating
>> engine, the ORM, or really anything above the routing code in most
>> cases. It's probably worth disabling the rest.
>>
>> Django is pretty heavyweight. Tristram benchmarked it against Flask
>> and others and came up with some interesting results:
>> https://workflowy.com/shared/1574979c-4603-a345-a145-a6dbb7174885/
>>
>> Unfortunately, the Preferred Technologies page pretty much forces us
>> to use it, but that doesn't mean we cannot strip it down to just what
>> we need in each case.
>>
>> Attached is the diagram Nick and Liam drew for how we might layout
>> each component. Keep in mind this is for a single microrservice. We'd
>> want this layout for each one. You can ignore the bit at the top for
>> squid. We won't need that on the front of most things. Instead, a
>> simple Apache in front of HAProxy will suffice.
>>
>> For good examples of how to do haproxy in prodstack, both psearch (in
>> lp:ubuntuone-servers-deploy) and certification (in
>> lp:~canonical-losas/canonical-is-charms/certification) were
>> recommended.
>>
>> Thanks!
>>
>> (Tristram, Liam, and Nick, if I got any of the above wrong, please do
>> correct me.)
>
>

Follow ups

Re: On resiliency to failure
From: Joe Talbott, 2013-12-03
Re: On resiliency to failure
From: Tristram Oaten, 2013-12-03

References

On resiliency to failure
From: Evan Dandrea, 2013-12-03
Re: On resiliency to failure
From: Tristram Oaten, 2013-12-03