canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00428
Re: On resiliency to failure
>>>>> Andy Doan <andy.doan@xxxxxxxxxxxxx> writes:
> On 12/03/2013 10:07 AM, Evan Dandrea wrote:
>> Attached is the diagram Nick and Liam drew for how we might layout
>> each component. Keep in mind this is for a single microrservice. We'd
>> want this layout for each one. You can ignore the bit at the top for
>> squid. We won't need that on the front of most things. Instead, a
>> simple Apache in front of HAProxy will suffice.
>> http://ubuntuone.com/0w12vBEgDVn4YUMh5JkoMq
> Thinking about this generically, I'm not sure how specific these issues
> relate to Django. ie - all solutions at a minimum are going to require some
> sort of webserver and data-store and/or queue system. The specific solutions
> to make each implementation highly-available will differ, but they'll all
> require something.
> As we start to talk more concretely about high availability, I'm starting to
> wonder if we should first ask "is it worth it?"
Yes. I think it is worth it. Especially at the stage where no tests nor
code is written.
> ie - what could we expect our availability to be if we just
> deployed a DB and a couple of web-servers. If the answer is >98%,
> then is it worth the man-hours required to get us to 99.9%?
And I'd counter that with: how many man-hours will we spend filling up
to 100% ? As in: every time something fails in the ci engine:
- someone is blocked for X hours,
- said someone ping IS or the Vanguard and wait for Y hours,
- IS/Vanguard spend Z hours diagnosing the issue, devising a fix,
testing it, deploying it.
Sure, in many cases you can reduce Y and Z, but still, X is a loss.
> from another angle: the ppa-assigner component we have will
> probably have less 100 operations a day. So are the odds of it
> being down at the precise time one of those operations are
> executed already <1%?
Then, let's use it as a simple component where we can more easily
experiment ideas, designs, tests and implementations to get a robust and
reliable architecture in place that we can reuse for other components.
> I'm really not wanting to sound lazy here. But this feels like its
> snowballing to a place where getting a few components demo-worthy
> might be growing too fast.
Right, there is a risk, let's keep it in front of our eyes.
> However, if we don't do this now it might make it too expensive
> later.
You nailed it ;)
Vincent
Follow ups
References