canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
From: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
Date: Sun, 10 May 2015 18:10:00 -0500
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAFkWetm8FmmNMWOeMFDixnFVF1hMuqOY7hyqmF-zFQAs60fj9g@mail.gmail.com>
Sender: evan@xxxxxxxxxxxxxx

On 8 May 2015 at 17:34, Thomi Richards <thomi.richards@xxxxxxxxxxxxx> wrote:
> I agree that there's deeper complexity here: What caused the key-pairs to
> leak? However, our priority is keeping the system from catastrophic failure.
> I simply don't care what caused the key-pairs to leak when all the jobs are
> failing: I just want to be alerted, and to fix the immediate problem to
> hand.

There are many cases like this. Today it is keypairs. Tomorrow it is
security groups. Thursday keystone will go down. Should we hunt down
each of these individually? Would we not cover more ground and save
ourselves considerable work if we didn't just let things fall into the
dead letter queue?

Pop the top message. See that we're out of keypairs. Fix that. Blindly
throw everything from the dead letter queue back to the front of the
system. Wait to see if anything else dead letters.

If we've actually fixed the leak, we haven't needed to add any
monitoring code (complexity) and we won't be troubled with queue
processing more than once.

> This scenario will happen again. Without monitoring, our workers will start
> failing 100% of jobs passed to them, which will halt proposed-migration from
> running. I'd like to make that less likely to happen.

Why aren't these just going to the dead letter queue where they can be
manually reviewed and thrown back into the mix?

> We should be building systems that we can diagnose and fix *before* our
> stakeholders call us and say "proposed migration has been broken for 2 days
> now".

Agreed. If I may tweak slightly, we should be building systems that
let us diagnose and fix /at our own pace/, without our stakeholders
ever seeing an error message. We should only give them results when
there are results to act on, and infrastructure failure is not
something they can act on.

If Jenkins timeouts are fucking us here, let's fix Jenkins. Or, you
know, burn it.

>> We are only going to create false confidence and waste time if we take
>> the position that the more holes we plug, the safer we will be.
>
> To me, this boils down to "we'll never catch everything, so why bother
> trying to catch anything".

Not at all. I am trying to suggest that instead of leaping to solve
every narrowly defined bug, we first see if we can bucket a larger
group simply, thus with minimal effort and complexity.

Let's be clever about how we architect for failure. You guys are too
smart to focus on the immediate problem instead of the bigger picture,
to let the system constantly put you on the back foot in a game of
whack-a-mole.

> Also, yes I think that the more holes we plug, the safer we will be. There
> are many weird and wonderful ways systems can fail, but we can, over time,
> make the system more robust. Can we ever predict and account for everything?
> of course not. We can make things better though. Call me an optimist if you
> want, but I don't think there's literally an infinite number of ways the
> system can fail. It think there's a finite upper bound. Therefore, any
> situations we can monitor and alert on, the more robust the system will be.

You mistake my realism for pessimism. I am not saying we should throw
up our hands and declare quality a lost cause. I am suggesting we
accept that there are more instances of failure that we can anticipate
than we have time to individually address, and that there is a whole
class of unknown failures for which we can do nothing specific.

If we fix individual problems and ignore the unknowns, our optimism
creates that false sense of security and a misperceived path towards
greater stability.

Let us focus our efforts on solutions that acknowledge the unknown
classes of failure, that this is not some Newtonian system that can be
easily reduced to a finite set of states.

> What happens when we run out of keypairs late on a Friday night? What
> happens if we're all busy on some new sprint and don't think to monitor the
> dead letter queue?

See below.

> But we don't inspect the dead letter queue at all!  Yesterday we had 361
> messages in the adt deadletters queue. Today there's 5. I suspect we dumped
> the queue because it was getting too big. I agree that if we had some easy
> way of inspecting the queue and investigating issues we might catch these
> problems sooner, but we don't, and haven't even planned for one.

I would argue that's the biggest problem here. We should send the dead
letter queue size to logstash, and connect PagerDuty to the latter.
Any objections to getting that in as extra work for the next sprint?

Follow ups

Re: proposal for next sprint
From: Thomi Richards, 2015-05-11

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: psivaa, 2015-05-07
Re: proposal for next sprint
From: Celso Providelo, 2015-05-07
Re: proposal for next sprint
From: Ursula Junque, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Evan Dandrea, 2015-05-08
Re: proposal for next sprint
From: Thomi Richards, 2015-05-08