canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: psivaa <para.siva@xxxxxxxxxxxxx>
From: Celso Providelo <celso.providelo@xxxxxxxxxxxxx>
Date: Thu, 7 May 2015 12:14:34 -0300
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <554B3495.6050509@canonical.com>

Hi guys,

Let me include some extra information to this thread that might help us to
solve some obstacles for achieving better visibility about our systems.

The pattern we have established for pushing rich-events via python-logstash
has been serving us very well, I'd be wary of initiatives to retrofit
working services with anything else without identifying exactly what we are
missing with this approach. Specially because we can leverage logstash
capabilities to hub events conditionally to other systems statsd/graphite,
IRC and PagerDuty [1].

There are undeniable visualisation limitations with kibana3 and that's the
possible motivation for thinking about Graphite and Prometheus, however,
frankly speaking, they look pretty much the same from the features they
provide and will require extra infrastructure deployment (and maintenance)
for providing better visualisation of the data we already have (and can
easily augment, if necessary). If the problem is indeed only visualisation,
let's evaluate kibana4 [2], which would provide a much smoother migration
path.

Moreover, I feel that we are only scratching the surface of ELK in terms of
capabilities to provide answers to our current problems, and this idea that
Graphite would give us free-(lunch-)metrics is not entirely true, they
still have to be built/modelled in the services and in Graphite, i.e. it's
more about figuring out what we want to see than how to do it.

Let's talk about the problems we are trying to solve with metrics ... From
what we have already experienced and you have reported, we would like to 1)
visualise *some* (not clear to me yet) performance/duration iterations and
also 2) be alerted about misbehaving/malfunctioning units.

First, let's agree that they are distinct problems.

Performance visualisation on heterogeneous tasks is a complex problem,
despite of the tool kibana, graphite or prometheus, even if we push
individual steps duration (extra['duration']) on events, I am struggling to
see how we could make a lot of sense of these data in a periodic series
without being restricted to filtering individual sources (even though it
would be tied to increase/shrink of tests). Anyway, it would be much
cheaper to push extra data in the existing events and see how they could be
combined/visualised in kibana and maybe that's the most efficient and
useful experiment/spike we could at this point.

Alerting is something we are completely missing,  we depend on someone to
access kibana, interpret the graphs and act if it is the case. So problems
goes unnoticed every time. I, personally, think this is a much more
pressing issue to be tackled and, as pointed above, do not depend on any
new infrastructure, just extending LS configuration.

Despite of the umbrella-check-retry done in result-checker, I think we are
interested in alerts for *all* ERROR events from units. We could get those
via IRC, PD or Email (I think we should decide the which media suits us
better during the spike-story). This way the vanguard person would be
alerted and act upon any:

 * Spurious failures (e.g. glance client cached connection timeout ->
should be fixed)
 * Unit failure, even if it was retried locally or by the result-checker
(not visible to users), it is still a problem to be fixed in code (block
new deployment promotion) or one of the myriad of possible environment
problems that could prevent a worker to deliver its results (more on this
below)
 * Ultimately a deadletter-ed message, which would a be a problem visible
to user

How does it sound ? First we get aware of the problems in a active way,
then with that data we decide how can we pro-actively prevent them.

This task looks small and objective enough to fit in a spike-story and
would move us consistently forward on this subject.

Let me briefly give you my take on this wish of monitoring *everything*
hoping they will someday matter, like tenant quota usage, individual unit
raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause
Analysis mindset we were taught our entire life [3], but this
simplification becomes suboptimal for complex systems, where solving the
'root-cause' often uncover new problems with new effects not monitored
before, i.e. monitoring all the possible cause-effect combinations becomes
expensive because of their unpredictable relationships, we will never
monitor enough to prevent problems.

A practical example is the keypair leaking from uci-nova when port quota is
exhausted. While monitoring keypair quota looks useful for identifying that
there is a leak, unfortunately it would not point us to the *real* cause of
the problem, we would still need a human to interpret results and decide
how to sort it out and meanwhile the problem would escalate and end up
affecting the service availability.

For instance, instead of passively trying to collect isolated data and hope
a human would show up to sort it out quickly, we could simply kill/stop
cloud-workers units that resulted in exit_code 16, that would arguably
decrease system throughput, but would control/cease the damage without
exposing unavailability to users while we analyse and solve the problem.

This is just one example of how I think we should operate systems with this
level of complexity, instead of trying to model complex and unpredictable
cause - effect pairs, we buy time to perform deep analysis and work on
fixes by isolating/removing problematic units ...

Err ... this message ended up longer than I expect, sorry for the lack of
objectiveness on this.

[1] http://logstash.net/docs/1.4.2/outputs/{statsd, irc, pagerduty}
[2] https://www.elastic.co/blog/kibana-4-literally
[3] “For every action there is an equal and opposite reaction.” Therefore
every cause has an effect and every effect has a cause

On Thu, May 7, 2015 at 6:47 AM, psivaa <para.siva@xxxxxxxxxxxxx> wrote:

>
>
> On 07/05/15 05:53, Thomi Richards wrote:
>
> A small anecdote from my afternoon, that taught me a few lessons:
>
>
> Today workers started failing all tests. First it was one worker, then two
> workers, then a bit later it was three workers. I happened to be looking at
> the jenkins jobs, and caught the error message as it went past. However,
> that was pure luck. I could easily imagine a scenario where no one was
> looking at the jenkins jobs, and it might take 24 hours to realise that
> proposed-migration was horribly backed up.
>
> First thought: We might have seen it if we were tracking 'pass rate per
> worker' - we'd have seen one worker spike to 100% fail rate. We could alert
> on that, even.
>
> After diagnosing it, it turns out the problem was that we'd slowly been
> leaking nova keypairs, and had hit our quota limit. This was easy to fix (I
> deleted the unused keypairs), but it got me thinking...
>
> Second Thought: We should be monitoring everything where there's a hard
> quota in place. We could easily track 'num keypairs left before the world
> ends', and alert if that got below `N`.
>
> To my mind, we monitor disk space, and keypairs are a similar category:
>  * We have a known hard quota.
>  * It's a pretty catastrophic failure when we run out.
>  * Measuring how many you have left is reasonably trivial.
>
>  Although our first priority is to stop any leaks (which iirc we did for secgroups already due to some other reasons), I like this idea of monitoring and alerting when the stock goes low.
> It may seem that for certain solutions, for.e.g. bbb, we *may not see similar type of issues, but it's always best to monitor them, like a stock control system.
>
> Also, we could use this collated data to back up any quota increase requests, mentioned somewhere else in this thread.
>
>
>
> If I may be permitted to jump into full engineering-implementation mode for
> a second....
>
>
> Imagine a generic service that simply exposes statistics for the openstack
> tennant it's deployed in? We coudl then deploy one in every tennant we use,
> and have the stats monitoring system read those....
>
>
> Anyway, I thought that was an interesting experience :D
>
>
>
>
>
>
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to     : canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help   : https://help.launchpad.net/ListHelp
>
>

-- 
Celso Providelo
celso.providelo@xxxxxxxxxxxxx

Follow ups

Re: proposal for next sprint
From: Ursula Junque, 2015-05-07

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: psivaa, 2015-05-07