← Back to team overview

canonical-ci-engineering team mailing list archive

Re: proposal for next sprint

 

On 7 May 2015 at 16:19, Thomi Richards <thomi.richards@xxxxxxxxxxxxx> wrote:
>>> On Thu, May 7, 2015 at 12:14 PM, Celso Providelo
>>> <celso.providelo@xxxxxxxxxxxxx> wrote:
>>>> Let me briefly give you my take on this wish of monitoring *everything*
>>>> hoping they will someday matter, like tenant quota usage, individual unit
>>>> raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause
>>>> Analysis mindset we were taught our entire life [3], but this simplification
>>>> becomes suboptimal for complex systems, where solving the 'root-cause' often
>>>> uncover new problems with new effects not monitored before, i.e. monitoring
>>>> all the possible cause-effect combinations becomes expensive because of
>>>> their unpredictable relationships, we will never monitor enough to prevent
>>>> problems.

...

> I get that root-cause-analysis for complex systems is hard, but
> "cause-analysis" can be easy. In my example, the symptom was "OMG we're
> running out of keypairs". The cause is "we're leaking keypairs". Who cares
> about the root cause? There is a problem (running out of keypairs), and if
> we don't fix it our system will break, and will break _hard_. Let's track
> these things which we know, if they ever happen, will break our services.
>
> We'll never be smart enough to predict all future sources of failure, but we
> know from experience that there's a class of failure we can monitor and
> alert on.

The problem is not that root cause analysis is hard in complex
systems. The problem is that there is no root cause of failure in
complex systems. Newton's Third Law only works when you can isolate
down to singular inputs, but that is necessarily not what we're
dealing with when watching a process influenced by its environment
(think of the Butterfly Effect). It is non-deterministic.

We are only going to create false confidence and waste time if we take
the position that the more holes we plug, the safer we will be.

There's a myriad of ways that work can end up dead-lettered. Work in
such a holding pattern can be investigated by a human (at their
leisure) and thrown back onto the queue when the problem is fixed, all
without the end user ever noticing. Why bother to isolate and report
on each individual variant of these when we will always have to go to
the queue and inspect the messages?


Follow ups

References