canonical-ci-engineering team mailing list archive

Thread
Date

Re: On resiliency to failure

To: Andy Doan <andy.doan@xxxxxxxxxxxxx>
From: Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx>
Date: Wed, 04 Dec 2013 10:07:03 +0100
Cc: liam.young@xxxxxxxxxxxxx, Lex Moffitt <nick.moffitt@xxxxxxxxxxxxx>, tristram.oaten@xxxxxxxxxxxxx, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <529E264A.2010200@canonical.com> (Andy Doan's message of "Tue, 03 Dec 2013 12:43:22 -0600")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

>>>>> Andy Doan <andy.doan@xxxxxxxxxxxxx> writes:

    > On 12/03/2013 10:07 AM, Evan Dandrea wrote:
    >> Attached is the diagram Nick and Liam drew for how we might layout
    >> each component. Keep in mind this is for a single microrservice. We'd
    >> want this layout for each one. You can ignore the bit at the top for
    >> squid. We won't need that on the front of most things. Instead, a
    >> simple Apache in front of HAProxy will suffice.

    >> http://ubuntuone.com/0w12vBEgDVn4YUMh5JkoMq

    > Thinking about this generically, I'm not sure how specific these issues
    > relate to Django. ie - all solutions at a minimum are going to require some
    > sort of webserver and data-store and/or queue system. The specific solutions
    > to make each implementation highly-available will differ, but they'll all
    > require something.

    > As we start to talk more concretely about high availability, I'm starting to
    > wonder if we should first ask "is it worth it?"

Yes. I think it is worth it. Especially at the stage where no tests nor
code is written.

    > ie - what could we expect our availability to be if we just
    > deployed a DB and a couple of web-servers. If the answer is >98%,
    > then is it worth the man-hours required to get us to 99.9%?

And I'd counter that with: how many man-hours will we spend filling up
to 100% ? As in: every time something fails in the ci engine:

- someone is blocked for X hours,
- said someone ping IS or the Vanguard and wait for Y hours,
- IS/Vanguard spend Z hours diagnosing the issue, devising a fix,
  testing it, deploying it.

Sure, in many cases you can reduce Y and Z, but still, X is a loss.

    > from another angle: the ppa-assigner component we have will
    > probably have less 100 operations a day. So are the odds of it
    > being down at the precise time one of those operations are
    > executed already <1%?

Then, let's use it as a simple component where we can more easily
experiment ideas, designs, tests and implementations to get a robust and
reliable architecture in place that we can reuse for other components.

    > I'm really not wanting to sound lazy here. But this feels like its
    > snowballing to a place where getting a few components demo-worthy
    > might be growing too fast.

Right, there is a risk, let's keep it in front of our eyes.

    > However, if we don't do this now it might make it too expensive
    > later.

You nailed it ;)

    Vincent

Follow ups

Re: On resiliency to failure
From: Andy Doan, 2013-12-04

References

On resiliency to failure
From: Evan Dandrea, 2013-12-03
Re: On resiliency to failure
From: Andy Doan, 2013-12-03