← Back to team overview

canonical-ci-engineering team mailing list archive

Re: Ephemeral workers for the win

 

On 04/11/2014 08:02 AM, Evan Dandrea wrote:
Questions:

- Does this sound sensible? How do we know when to tell Nagios that
the Vanguard needs to be contacted? On the first OOPS, or some other
condition?
- This only saves us when we get as far as the Rabbit event loop.
We'll have to invent some sort of watchdog for the case when the
process dies prior to that point. What should that look like?

this part probably works also. Then lander asks the ticket-system for the next ticket (which is determined by the ticket-state). I don't think we update the ticket-state until we've delivered our first message to rabbit. So if things fail during this window, the next time the lander asks for the next ticket it will get this same ticket.

- What's unaccounted for?

The workers currently treat a KeyboardInterrupt (which is what an upstart restart of the job will send) as meaning someone canceled the ticket and acks the message before exiting. We could remove the message ack'ing, but then you wind up in a position where its hard to recover from a malformed/un-expected message. It would be nice if we were smart enough to somehow exit either way.


References