canonical-ci-engineering team mailing list archive

Thread
Date

Re: On resiliency to failure

To: Andy Doan <andy.doan@xxxxxxxxxxxxx>
From: Francis Ginther <francis.ginther@xxxxxxxxxxxxx>
Date: Tue, 3 Dec 2013 15:58:58 -0600
Cc: liam.young@xxxxxxxxxxxxx, Lex Moffitt <nick.moffitt@xxxxxxxxxxxxx>, tristram.oaten@xxxxxxxxxxxxx, canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <529E264A.2010200@canonical.com>

On Tue, Dec 3, 2013 at 12:43 PM, Andy Doan <andy.doan@xxxxxxxxxxxxx> wrote:
> As we start to talk more concretely about high availability, I'm starting to
> wonder if we should first ask "is it worth it?" ie - what could we expect
> our availability to be if we just deployed a DB and a couple of web-servers.
> If the answer is >98%, then is it worth the man-hours required to get us to
> 99.9%?

Agreed. What components truly need to be under HA redundancy and which
do we just restart when they fail? I'll argue that only the Projects
Manager needs to be redundant. For the others I was hoping to backup
any persistent storage to swift or even rely directly on swift for
persistent data. Are there any patterns for doing this? I'm of the
opinion that our time is better spent making our APIs capable of
gracefully dealing with failure rather then making sure components
don't fail.

So, can we have a common strategy for when the PPA Assigner (for
example) fails? What is monitoring the components for failure and
standing them back up? How do consumers handle timeouts? What status
is communicated? ...?

Francis
-- 
Francis Ginther
Canonical - Ubuntu Engineering - Continuous Integration Team

References

On resiliency to failure
From: Evan Dandrea, 2013-12-03
Re: On resiliency to failure
From: Andy Doan, 2013-12-03