← Back to team overview

canonical-ci-engineering team mailing list archive

Ephemeral workers for the win

 

1. Evan juju upgrades the deployed test runner component with some broken code.
2. A ticket comes through to the test runner worker and it crashes.
Because the worker didn't ack this message, the ticket goes back on
the queue.
3. Round and round it spins. It comes up to a worker again, fails, and
goes back in the queue.

Now, we could leave it in this state forever and let the user come to
us to say that the ticket appears wedged, but...

With each new attempt, the test runner worker reports an OOPS for
failure to process that message in the queue. We can then deal with
this *asynchronously.* Here is the cool part:

We juju upgrade the deployed test runner component again and the
ticket escapes the loop.

The test runner finishes and passes the ticket onto the next step. We
didn't have to retry or resubmit an entire ticket. The work just sat
there waiting for the environment to get better so it could continue.

It wasn't a stop the line event. We could deal with it without
worrying that a component was down and UE was losing development time
because they couldn't submit new tickets.

Questions:

- Does this sound sensible? How do we know when to tell Nagios that
the Vanguard needs to be contacted? On the first OOPS, or some other
condition?
- This only saves us when we get as far as the Rabbit event loop.
We'll have to invent some sort of watchdog for the case when the
process dies prior to that point. What should that look like?
- What's unaccounted for?


Follow ups