← Back to team overview

canonical-ci-engineering team mailing list archive

Re: On resiliency to failure

 

>>>>> Andy Doan <andy.doan@xxxxxxxxxxxxx> writes:

    > i'm purely playing devil's advocate.

Challenge accepted ;)

    > On 12/04/2013 03:07 AM, Vincent Ladeuil wrote:
    >> > As we start to talk more concretely about high availability, I'm starting to
    >> > wonder if we should first ask "is it worth it?"
    >> 
    >> Yes. I think it is worth it. Especially at the stage where no tests nor
    >> code is written.

    > Correct, but we have to make sure we don't get so bogged down in
    > trying to make deployment code that we never actually create the
    > features people need.

Yup, so let's focus on simple things first, they are easiest to design
for reliability.

    >> > ie - what could we expect our availability to be if we just
    >> > deployed a DB and a couple of web-servers. If the answer is >98%,
    >> > then is it worth the man-hours required to get us to 99.9%?
    >> 
    >> And I'd counter that with: how many man-hours will we spend filling up
    >> to 100% ? As in: every time something fails in the ci engine:
    >> 
    >> - someone is blocked for X hours,
    >> - said someone ping IS or the Vanguard and wait for Y hours,
    >> - IS/Vanguard spend Z hours diagnosing the issue, devising a fix,
    >> testing it, deploying it.
    >> 
    >> Sure, in many cases you can reduce Y and Z, but still, X is a loss.

    > The question is what percent of X is because failures that HA
    > would prevent?  I don't think we have this quantified. In my mind
    > its probably less than 50%. From your comments, I'm guessing you
    > feel differently.

Yeah, I want failure modes that requires human intervention to
disappear. I.e. driving X to zero.

Or if not zero, at least handling 90% of the cases automatically so we
can focus on the remaining 10%.

We already faced (|are still facing) similar issues, see:

- jenkins slaves need to restart automatically
  https://app.asana.com/0/8740321118011/8740321118013

- stopping otto containers left running
  https://app.asana.com/0/search/9060124218653/8792066028175

I.e. we need to design for failure.

It's ok to have component failures as long as we recover and the whole
engine keeps running.

       Vincent


References