canonical-ci-engineering team mailing list archive

Thread
Date

Re: Ephemeral workers for the win

To: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
From: Paul Larson <paul.larson@xxxxxxxxxxxxx>
Date: Fri, 11 Apr 2014 09:34:34 -0500
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAOe9oG4pbmN8Gm9hZ+mU3JnOvX=g9ysK3p7vgtvd2c1Ds1_Mug@mail.gmail.com>

On Fri, Apr 11, 2014 at 8:02 AM, Evan Dandrea
<evan.dandrea@xxxxxxxxxxxxx> wrote:
> 1. Evan juju upgrades the deployed test runner component with some broken code.
> 2. A ticket comes through to the test runner worker and it crashes.
> Because the worker didn't ack this message, the ticket goes back on
> the queue.
> 3. Round and round it spins. It comes up to a worker again, fails, and
> goes back in the queue.
>
> Now, we could leave it in this state forever and let the user come to
> us to say that the ticket appears wedged, but...
>
> With each new attempt, the test runner worker reports an OOPS for
> failure to process that message in the queue. We can then deal with
> this *asynchronously.* Here is the cool part:
>
> We juju upgrade the deployed test runner component again and the
> ticket escapes the loop.
>
> The test runner finishes and passes the ticket onto the next step. We
> didn't have to retry or resubmit an entire ticket. The work just sat
> there waiting for the environment to get better so it could continue.
>
> It wasn't a stop the line event. We could deal with it without
> worrying that a component was down and UE was losing development time
> because they couldn't submit new tickets.
>
> Questions:
>
> - Does this sound sensible? How do we know when to tell Nagios that
> the Vanguard needs to be contacted? On the first OOPS, or some other
> condition?
Until we know otherwise, I think all OOPS events should be reported
right away.  That would be the safest option until we have more data.
Now, if later we see that some of them are not so serious, then we
would just need to take efforts to filter them, or maybe even reduce
the notification timeout if things are capable of self-recovering.

> - This only saves us when we get as far as the Rabbit event loop.
> We'll have to invent some sort of watchdog for the case when the
> process dies prior to that point. What should that look like?
One of the things it would probably need some way of watching for, is
whether nothing is actually handling the message (if we don't ack
before processing), or on the other end, if we ack first, making sure
the task gets back in the queue if something dies while trying to take
care of it, right?

References

Ephemeral workers for the win
From: Evan Dandrea, 2014-04-11