canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #01092
Re: proposal for next sprint
A small anecdote from my afternoon, that taught me a few lessons:
Today workers started failing all tests. First it was one worker, then two
workers, then a bit later it was three workers. I happened to be looking at
the jenkins jobs, and caught the error message as it went past. However,
that was pure luck. I could easily imagine a scenario where no one was
looking at the jenkins jobs, and it might take 24 hours to realise that
proposed-migration was horribly backed up.
First thought: We might have seen it if we were tracking 'pass rate per
worker' - we'd have seen one worker spike to 100% fail rate. We could alert
on that, even.
After diagnosing it, it turns out the problem was that we'd slowly been
leaking nova keypairs, and had hit our quota limit. This was easy to fix (I
deleted the unused keypairs), but it got me thinking...
Second Thought: We should be monitoring everything where there's a hard
quota in place. We could easily track 'num keypairs left before the world
ends', and alert if that got below `N`.
To my mind, we monitor disk space, and keypairs are a similar category:
* We have a known hard quota.
* It's a pretty catastrophic failure when we run out.
* Measuring how many you have left is reasonably trivial.
If I may be permitted to jump into full engineering-implementation mode for
a second....
Imagine a generic service that simply exposes statistics for the openstack
tennant it's deployed in? We coudl then deploy one in every tennant we use,
and have the stats monitoring system read those....
Anyway, I thought that was an interesting experience :D
--
Thomi Richards
thomi.richards@xxxxxxxxxxxxx
Follow ups
References