canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00460
Re: On resiliency to failure
>>>>> Andy Doan <andy.doan@xxxxxxxxxxxxx> writes:
> i'm purely playing devil's advocate.
Challenge accepted ;)
> On 12/04/2013 03:07 AM, Vincent Ladeuil wrote:
>> > As we start to talk more concretely about high availability, I'm starting to
>> > wonder if we should first ask "is it worth it?"
>>
>> Yes. I think it is worth it. Especially at the stage where no tests nor
>> code is written.
> Correct, but we have to make sure we don't get so bogged down in
> trying to make deployment code that we never actually create the
> features people need.
Yup, so let's focus on simple things first, they are easiest to design
for reliability.
>> > ie - what could we expect our availability to be if we just
>> > deployed a DB and a couple of web-servers. If the answer is >98%,
>> > then is it worth the man-hours required to get us to 99.9%?
>>
>> And I'd counter that with: how many man-hours will we spend filling up
>> to 100% ? As in: every time something fails in the ci engine:
>>
>> - someone is blocked for X hours,
>> - said someone ping IS or the Vanguard and wait for Y hours,
>> - IS/Vanguard spend Z hours diagnosing the issue, devising a fix,
>> testing it, deploying it.
>>
>> Sure, in many cases you can reduce Y and Z, but still, X is a loss.
> The question is what percent of X is because failures that HA
> would prevent? I don't think we have this quantified. In my mind
> its probably less than 50%. From your comments, I'm guessing you
> feel differently.
Yeah, I want failure modes that requires human intervention to
disappear. I.e. driving X to zero.
Or if not zero, at least handling 90% of the cases automatically so we
can focus on the remaining 10%.
We already faced (|are still facing) similar issues, see:
- jenkins slaves need to restart automatically
https://app.asana.com/0/8740321118011/8740321118013
- stopping otto containers left running
https://app.asana.com/0/search/9060124218653/8792066028175
I.e. we need to design for failure.
It's ok to have component failures as long as we recover and the whole
engine keeps running.
Vincent
References