canonical-ci-engineering team mailing list archive

Thread
Date

Re: On resiliency to failure

To: Andy Doan <andy.doan@xxxxxxxxxxxxx>
From: Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx>
Date: Mon, 09 Dec 2013 09:50:10 +0100
Cc: liam.young@xxxxxxxxxxxxx, Lex Moffitt <nick.moffitt@xxxxxxxxxxxxx>, tristram.oaten@xxxxxxxxxxxxx, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <529F4C82.6020509@canonical.com> (Andy Doan's message of "Wed, 04 Dec 2013 09:38:42 -0600")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

>>>>> Andy Doan <andy.doan@xxxxxxxxxxxxx> writes:

    > i'm purely playing devil's advocate.

Challenge accepted ;)

    > On 12/04/2013 03:07 AM, Vincent Ladeuil wrote:
    >> > As we start to talk more concretely about high availability, I'm starting to
    >> > wonder if we should first ask "is it worth it?"
    >> 
    >> Yes. I think it is worth it. Especially at the stage where no tests nor
    >> code is written.

    > Correct, but we have to make sure we don't get so bogged down in
    > trying to make deployment code that we never actually create the
    > features people need.

Yup, so let's focus on simple things first, they are easiest to design
for reliability.

    >> > ie - what could we expect our availability to be if we just
    >> > deployed a DB and a couple of web-servers. If the answer is >98%,
    >> > then is it worth the man-hours required to get us to 99.9%?
    >> 
    >> And I'd counter that with: how many man-hours will we spend filling up
    >> to 100% ? As in: every time something fails in the ci engine:
    >> 
    >> - someone is blocked for X hours,
    >> - said someone ping IS or the Vanguard and wait for Y hours,
    >> - IS/Vanguard spend Z hours diagnosing the issue, devising a fix,
    >> testing it, deploying it.
    >> 
    >> Sure, in many cases you can reduce Y and Z, but still, X is a loss.

    > The question is what percent of X is because failures that HA
    > would prevent?  I don't think we have this quantified. In my mind
    > its probably less than 50%. From your comments, I'm guessing you
    > feel differently.

Yeah, I want failure modes that requires human intervention to
disappear. I.e. driving X to zero.

Or if not zero, at least handling 90% of the cases automatically so we
can focus on the remaining 10%.

We already faced (|are still facing) similar issues, see:

- jenkins slaves need to restart automatically
  https://app.asana.com/0/8740321118011/8740321118013

- stopping otto containers left running
  https://app.asana.com/0/search/9060124218653/8792066028175

I.e. we need to design for failure.

It's ok to have component failures as long as we recover and the whole
engine keeps running.

       Vincent

References

On resiliency to failure
From: Evan Dandrea, 2013-12-03
Re: On resiliency to failure
From: Andy Doan, 2013-12-03
Re: On resiliency to failure
From: Vincent Ladeuil, 2013-12-04
Re: On resiliency to failure
From: Andy Doan, 2013-12-04