canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: Ursula Junque <ursula.junque@xxxxxxxxxxxxx>
From: Francis Ginther <francis.ginther@xxxxxxxxxxxxx>
Date: Thu, 7 May 2015 14:39:29 -0500
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAES319Hd2vqhAKBrC9vb5jiGnO8JHgwBKw41eKobYWJC9w1pQA@mail.gmail.com>

On Thu, May 7, 2015 at 11:02 AM, Ursula Junque <ursula.junque@xxxxxxxxxxxxx>
wrote:

> Hi,
>
> On Thu, May 7, 2015 at 12:14 PM, Celso Providelo <
> celso.providelo@xxxxxxxxxxxxx> wrote:
>
>> Hi guys,
>>
>> Let me include some extra information to this thread that might help us
>> to solve some obstacles for achieving better visibility about our systems.
>>
>> The pattern we have established for pushing rich-events via
>> python-logstash has been serving us very well, I'd be wary of initiatives
>> to retrofit working services with anything else without identifying exactly
>> what we are missing with this approach. Specially because we can leverage
>> logstash capabilities to hub events conditionally to other systems
>> statsd/graphite, IRC and PagerDuty [1].
>>
>> There are undeniable visualisation limitations with kibana3 and that's
>> the possible motivation for thinking about Graphite and Prometheus,
>> however, frankly speaking, they look pretty much the same from the features
>> they provide and will require extra infrastructure deployment (and
>> maintenance) for providing better visualisation of the data we already have
>> (and can easily augment, if necessary). If the problem is indeed only
>> visualisation, let's evaluate kibana4 [2], which would provide a much
>> smoother migration path.
>>
>
I wasn't aware kibana4 had these capabilities, thanks for providing the
insight. I find there are several aspects of kibana that I'm unaware of. I
get that it is a data aggregator, but I feel I'm only scratching the
surface of what I could be doing with it. I'm very interested in any
service that lets me push gobs of blobs of data to it so that I can then
pick and choose what I do with it later.


> Moreover, I feel that we are only scratching the surface of ELK in terms
>> of capabilities to provide answers to our current problems, and this idea
>> that Graphite would give us free-(lunch-)metrics is not entirely true, they
>> still have to be built/modelled in the services and in Graphite, i.e. it's
>> more about figuring out what we want to see than how to do it.
>>
>
>
> Just my two cents: I agree it's fundamental to understand the problems
> we're trying to solve before diving into solution details, and I believe
> that's the tiny missing bit in this discussion. That said, I think
> situations like this are the right moments to look into different
> technologies. For example: if the issue now is indeed visualization, we
> don't necessarily have to be limited to kibana 4, but use spikes to
> investigate that and also alternatives we want to evaluate, like the ones
> suggested in this thread. Timeboxed efforts ftw. :)
>

I think this thread is getting us to the point of understanding the
problems or at least bringing them to the surface. We had to start
somewhere.


>
>> Let's talk about the problems we are trying to solve with metrics ...
>> From what we have already experienced and you have reported, we would like
>> to 1) visualise *some* (not clear to me yet) performance/duration
>> iterations and also 2) be alerted about misbehaving/malfunctioning units.
>>
>> First, let's agree that they are distinct problems.
>>
>
Yep.

Performance visualisation on heterogeneous tasks is a complex problem,
>> despite of the tool kibana, graphite or prometheus, even if we push
>> individual steps duration (extra['duration']) on events, I am struggling to
>> see how we could make a lot of sense of these data in a periodic series
>> without being restricted to filtering individual sources (even though it
>> would be tied to increase/shrink of tests). Anyway, it would be much
>> cheaper to push extra data in the existing events and see how they could be
>> combined/visualised in kibana and maybe that's the most efficient and
>> useful experiment/spike we could at this point.
>>
>
Indeed monitoring something like average test time for packages going
through proposed migration is meaningless to me as well, but using the same
data to determine the percentage of time that a worker is busy is
meaningful. That's the kind of data I would want to know before changing
the scaling of a service. Perhaps solving this kind of problem is not a
priority right now, that's ok. But when the time comes, I sure would like
to be able to add that metric quickly if I didn't already have it.


> Alerting is something we are completely missing,  we depend on someone to
>> access kibana, interpret the graphs and act if it is the case. So problems
>> goes unnoticed every time. I, personally, think this is a much more
>> pressing issue to be tackled and, as pointed above, do not depend on any
>> new infrastructure, just extending LS configuration.
>>
>> Despite of the umbrella-check-retry done in result-checker, I think we
>> are interested in alerts for *all* ERROR events from units. We could get
>> those via IRC, PD or Email (I think we should decide the which media suits
>> us better during the spike-story). This way the vanguard person would be
>> alerted and act upon any:
>>
>>  * Spurious failures (e.g. glance client cached connection timeout ->
>> should be fixed)
>>  * Unit failure, even if it was retried locally or by the result-checker
>> (not visible to users), it is still a problem to be fixed in code (block
>> new deployment promotion) or one of the myriad of possible environment
>> problems that could prevent a worker to deliver its results (more on this
>> below)
>>  * Ultimately a deadletter-ed message, which would a be a problem visible
>> to user
>>
>> How does it sound ? First we get aware of the problems in a active way,
>> then with that data we decide how can we pro-actively prevent them.
>>
>> This task looks small and objective enough to fit in a spike-story and
>> would move us consistently forward on this subject.
>>
>
I really think alerting is a completely orthogonal topic. It's a good
topic, but it does deserve its own spike and its own priority.


>
>> Let me briefly give you my take on this wish of monitoring *everything*
>> hoping they will someday matter, like tenant quota usage, individual unit
>> raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause
>> Analysis mindset we were taught our entire life [3], but this
>> simplification becomes suboptimal for complex systems, where solving the
>> 'root-cause' often uncover new problems with new effects not monitored
>> before, i.e. monitoring all the possible cause-effect combinations becomes
>> expensive because of their unpredictable relationships, we will never
>> monitor enough to prevent problems.
>>
>
I get that it is futile to monitor everything. But at the same time I don't
understand why it's pointless to monitoring anything before there is a
problem. We built these systems, we have intuition where it's inefficient
and where our customers will have complaints ("Why isn't it faster?"). In
my head, there are already problems we are aware of, we just don't know to
what degree they are a problem. They fact that they are distributed systems
doesn't change that for me. This is my view of (1) above.

If your argument is that we're better off investing effort in
stopping/solving the underlying problems, which I consider topic (2), ...
(read on)

A practical example is the keypair leaking from uci-nova when port quota is
>> exhausted. While monitoring keypair quota looks useful for identifying that
>> there is a leak, unfortunately it would not point us to the *real* cause of
>> the problem, we would still need a human to interpret results and decide
>> how to sort it out and meanwhile the problem would escalate and end up
>> affecting the service availability.
>>
>> For instance, instead of passively trying to collect isolated data and
>> hope a human would show up to sort it out quickly, we could simply
>> kill/stop cloud-workers units that resulted in exit_code 16, that would
>> arguably decrease system throughput, but would control/cease the damage
>> without exposing unavailability to users while we analyse and solve the
>> problem.
>>
>> This is just one example of how I think we should operate systems with
>> this level of complexity, instead of trying to model complex and
>> unpredictable cause - effect pairs, we buy time to perform deep analysis
>> and work on fixes by isolating/removing problematic units ...
>>
>
... I'm fully behind this idea. What is preventing us from implementing
these circuit breakers? What do we need to finish around the external
pieces of the problem and connect errors to alerts? Can we develop our next
set of services to 'self-destruct' when they hit one of these errors? Do we
need a spike story here to do anything or are there some well understood
improvements we could implement right away? I think this is fundamentally a
distinct problem from "real-time stats monitoring" which is where this
thread started. I think both are important, but I'm a little worn out from
writing this to think about which one is more important at the moment :-).

Francis
-- 
Francis Ginther
Canonical - Ubuntu Engineering - Continuous Integration Team

Follow ups

Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Joe Talbott, 2015-05-07

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: psivaa, 2015-05-07
Re: proposal for next sprint
From: Celso Providelo, 2015-05-07
Re: proposal for next sprint
From: Ursula Junque, 2015-05-07