canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
From: Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
Date: Sat, 9 May 2015 10:34:53 +1200
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAOe9oG5pnjr0Jax53PQxq3mhORHBVrn8t+HvReQk9TBCPUzHnA@mail.gmail.com>

On Sat, May 9, 2015 at 9:52 AM, Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
wrote:

> The problem is not that root cause analysis is hard in complex
> systems. The problem is that there is no root cause of failure in
> complex systems. Newton's Third Law only works when you can isolate
> down to singular inputs, but that is necessarily not what we're
> dealing with when watching a process influenced by its environment
> (think of the Butterfly Effect). It is non-deterministic.
>
>
Who cares about root cause analysis? I just want to stop things breaking.
 :D

I think my anecdote demonstrates a clear example of where and how this
would help us. We had a failure, it's cause was easy and obvious to isolate
once it was brought to my attention, and we would have caught that failure
before it resulted in the system failing if we had monitoring & alerting in
place.

I agree that there's deeper complexity here: What caused the key-pairs to
leak? However, our priority is keeping the system from catastrophic
failure. I simply don't care what caused the key-pairs to leak when all the
jobs are failing: I just want to be alerted, and to fix the immediate
problem to hand.

This scenario *will* happen again. Without monitoring, our workers will
start failing 100% of jobs passed to them, which will halt
proposed-migration from running. I'd like to make that less likely to
happen.

We should be building systems that we can diagnose and fix *before* our
stakeholders call us and say "proposed migration has been broken for 2 days
now".

We are only going to create false confidence and waste time if we take
> the position that the more holes we plug, the safer we will be.
>
>
To me, this boils down to "we'll never catch everything, so why bother
trying to catch anything".

Monitoring breeds false confidence in exactly the same way as logging does.
If you assume that the monitoring & logging systems are giving you a 100%
accurate picture of the systems running then you're going to get hurt... so
don't do that. Instead, treat both these systems as hints to point you in
the right direction. They aid early detection of problems, they're not a
panacea.

Also, yes I think that the more holes we plug, the safer we will be. There
are many weird and wonderful ways systems can fail, but we can, over time,
make the system more robust. Can we ever predict and account for
everything? of course not. We can make things better though. Call me an
optimist if you want, but I don't think there's literally an infinite
number of ways the system can fail. It think there's a finite upper bound.
Therefore, any situations we can monitor and alert on, the more robust the
system will be.

> There's a myriad of ways that work can end up dead-lettered. Work in
> such a holding pattern can be investigated by a human (at their
> leisure) and thrown back onto the queue when the problem is fixed, all
> without the end user ever noticing.
>

What happens when we run out of keypairs late on a Friday night? What
happens if we're all busy on some new sprint and don't think to monitor the
dead letter queue?

Why bother to isolate and report
> on each individual variant of these when we will always have to go to
> the queue and inspect the messages?
>

But we don't inspect the dead letter queue at all!  Yesterday we had 361
messages in the adt deadletters queue. Today there's 5. I suspect we dumped
the queue because it was getting too big. I agree that if we had some easy
way of inspecting the queue and investigating issues we might catch these
problems sooner, but we don't, and haven't even planned for one.

I'd like to see some alternative suggestions for how to solve the problem
of being alerted when things start breaking. Manually monitoring logs or a
rabbit queue doesn't count - we surely don't want to start adding manual
log parsing to our list of day-to-day activities?

Perhaps there's a solution involving something that backs on to the dead
letter queue, but I'd like to see a concrete proposal  before discarding
the approach outlined in my initial email.

Cheers,
-- 
Thomi Richards
thomi.richards@xxxxxxxxxxxxx

Follow ups

Re: proposal for next sprint
From: Evan Dandrea, 2015-05-10

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: psivaa, 2015-05-07
Re: proposal for next sprint
From: Celso Providelo, 2015-05-07
Re: proposal for next sprint
From: Ursula Junque, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Evan Dandrea, 2015-05-08