canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #01093
Re: proposal for next sprint
On 07/05/15 05:53, Thomi Richards wrote:
A small anecdote from my afternoon, that taught me a few lessons:
Today workers started failing all tests. First it was one worker, then two
workers, then a bit later it was three workers. I happened to be looking at
the jenkins jobs, and caught the error message as it went past. However,
that was pure luck. I could easily imagine a scenario where no one was
looking at the jenkins jobs, and it might take 24 hours to realise that
proposed-migration was horribly backed up.
First thought: We might have seen it if we were tracking 'pass rate per
worker' - we'd have seen one worker spike to 100% fail rate. We could alert
on that, even.
After diagnosing it, it turns out the problem was that we'd slowly been
leaking nova keypairs, and had hit our quota limit. This was easy to fix (I
deleted the unused keypairs), but it got me thinking...
Second Thought: We should be monitoring everything where there's a hard
quota in place. We could easily track 'num keypairs left before the world
ends', and alert if that got below `N`.
To my mind, we monitor disk space, and keypairs are a similar category:
* We have a known hard quota.
* It's a pretty catastrophic failure when we run out.
* Measuring how many you have left is reasonably trivial.
Although our first priority is to stop any leaks (which iirc we did for secgroups already due to some other reasons), I like this idea of monitoring and alerting when the stock goes low.
It may seem that for certain solutions, for.e.g. bbb, we *may not see similar type of issues, but it's always best to monitor them, like a stock control system.
Also, we could use this collated data to back up any quota increase requests, mentioned somewhere else in this thread.
If I may be permitted to jump into full engineering-implementation mode for
a second....
Imagine a generic service that simply exposes statistics for the openstack
tennant it's deployed in? We coudl then deploy one in every tennant we use,
and have the stats monitoring system read those....
Anyway, I thought that was an interesting experience :D
Follow ups
References