← Back to team overview

canonical-ci-engineering team mailing list archive

Re: proposal for next sprint

 

Hi,

On Thu, May 7, 2015 at 12:14 PM, Celso Providelo <
celso.providelo@xxxxxxxxxxxxx> wrote:

> Hi guys,
>
> Let me include some extra information to this thread that might help us to
> solve some obstacles for achieving better visibility about our systems.
>
> The pattern we have established for pushing rich-events via
> python-logstash has been serving us very well, I'd be wary of initiatives
> to retrofit working services with anything else without identifying exactly
> what we are missing with this approach. Specially because we can leverage
> logstash capabilities to hub events conditionally to other systems
> statsd/graphite, IRC and PagerDuty [1].
>
> There are undeniable visualisation limitations with kibana3 and that's the
> possible motivation for thinking about Graphite and Prometheus, however,
> frankly speaking, they look pretty much the same from the features they
> provide and will require extra infrastructure deployment (and maintenance)
> for providing better visualisation of the data we already have (and can
> easily augment, if necessary). If the problem is indeed only visualisation,
> let's evaluate kibana4 [2], which would provide a much smoother migration
> path.
>

> Moreover, I feel that we are only scratching the surface of ELK in terms
> of capabilities to provide answers to our current problems, and this idea
> that Graphite would give us free-(lunch-)metrics is not entirely true, they
> still have to be built/modelled in the services and in Graphite, i.e. it's
> more about figuring out what we want to see than how to do it.
>


Just my two cents: I agree it's fundamental to understand the problems
we're trying to solve before diving into solution details, and I believe
that's the tiny missing bit in this discussion. That said, I think
situations like this are the right moments to look into different
technologies. For example: if the issue now is indeed visualization, we
don't necessarily have to be limited to kibana 4, but use spikes to
investigate that and also alternatives we want to evaluate, like the ones
suggested in this thread. Timeboxed efforts ftw. :)



> Let's talk about the problems we are trying to solve with metrics ... From
> what we have already experienced and you have reported, we would like to 1)
> visualise *some* (not clear to me yet) performance/duration iterations and
> also 2) be alerted about misbehaving/malfunctioning units.
>
> First, let's agree that they are distinct problems.
>
> Performance visualisation on heterogeneous tasks is a complex problem,
> despite of the tool kibana, graphite or prometheus, even if we push
> individual steps duration (extra['duration']) on events, I am struggling to
> see how we could make a lot of sense of these data in a periodic series
> without being restricted to filtering individual sources (even though it
> would be tied to increase/shrink of tests). Anyway, it would be much
> cheaper to push extra data in the existing events and see how they could be
> combined/visualised in kibana and maybe that's the most efficient and
> useful experiment/spike we could at this point.
>
> Alerting is something we are completely missing,  we depend on someone to
> access kibana, interpret the graphs and act if it is the case. So problems
> goes unnoticed every time. I, personally, think this is a much more
> pressing issue to be tackled and, as pointed above, do not depend on any
> new infrastructure, just extending LS configuration.
>
> Despite of the umbrella-check-retry done in result-checker, I think we are
> interested in alerts for *all* ERROR events from units. We could get those
> via IRC, PD or Email (I think we should decide the which media suits us
> better during the spike-story). This way the vanguard person would be
> alerted and act upon any:
>
>  * Spurious failures (e.g. glance client cached connection timeout ->
> should be fixed)
>  * Unit failure, even if it was retried locally or by the result-checker
> (not visible to users), it is still a problem to be fixed in code (block
> new deployment promotion) or one of the myriad of possible environment
> problems that could prevent a worker to deliver its results (more on this
> below)
>  * Ultimately a deadletter-ed message, which would a be a problem visible
> to user
>
> How does it sound ? First we get aware of the problems in a active way,
> then with that data we decide how can we pro-actively prevent them.
>
> This task looks small and objective enough to fit in a spike-story and
> would move us consistently forward on this subject.
>
> Let me briefly give you my take on this wish of monitoring *everything*
> hoping they will someday matter, like tenant quota usage, individual unit
> raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause
> Analysis mindset we were taught our entire life [3], but this
> simplification becomes suboptimal for complex systems, where solving the
> 'root-cause' often uncover new problems with new effects not monitored
> before, i.e. monitoring all the possible cause-effect combinations becomes
> expensive because of their unpredictable relationships, we will never
> monitor enough to prevent problems.
>
> A practical example is the keypair leaking from uci-nova when port quota
> is exhausted. While monitoring keypair quota looks useful for identifying
> that there is a leak, unfortunately it would not point us to the *real*
> cause of the problem, we would still need a human to interpret results and
> decide how to sort it out and meanwhile the problem would escalate and end
> up affecting the service availability.
>
> For instance, instead of passively trying to collect isolated data and
> hope a human would show up to sort it out quickly, we could simply
> kill/stop cloud-workers units that resulted in exit_code 16, that would
> arguably decrease system throughput, but would control/cease the damage
> without exposing unavailability to users while we analyse and solve the
> problem.
>
> This is just one example of how I think we should operate systems with
> this level of complexity, instead of trying to model complex and
> unpredictable cause - effect pairs, we buy time to perform deep analysis
> and work on fixes by isolating/removing problematic units ...
>
> Err ... this message ended up longer than I expect, sorry for the lack of
> objectiveness on this.
>
> [1] http://logstash.net/docs/1.4.2/outputs/{statsd, irc, pagerduty}
> [2] https://www.elastic.co/blog/kibana-4-literally
> [3] “For every action there is an equal and opposite reaction.” Therefore
> every cause has an effect and every effect has a cause
>
> On Thu, May 7, 2015 at 6:47 AM, psivaa <para.siva@xxxxxxxxxxxxx> wrote:
>
>>
>>
>> On 07/05/15 05:53, Thomi Richards wrote:
>>
>> A small anecdote from my afternoon, that taught me a few lessons:
>>
>>
>> Today workers started failing all tests. First it was one worker, then two
>> workers, then a bit later it was three workers. I happened to be looking at
>> the jenkins jobs, and caught the error message as it went past. However,
>> that was pure luck. I could easily imagine a scenario where no one was
>> looking at the jenkins jobs, and it might take 24 hours to realise that
>> proposed-migration was horribly backed up.
>>
>> First thought: We might have seen it if we were tracking 'pass rate per
>> worker' - we'd have seen one worker spike to 100% fail rate. We could alert
>> on that, even.
>>
>> After diagnosing it, it turns out the problem was that we'd slowly been
>> leaking nova keypairs, and had hit our quota limit. This was easy to fix (I
>> deleted the unused keypairs), but it got me thinking...
>>
>> Second Thought: We should be monitoring everything where there's a hard
>> quota in place. We could easily track 'num keypairs left before the world
>> ends', and alert if that got below `N`.
>>
>> To my mind, we monitor disk space, and keypairs are a similar category:
>>  * We have a known hard quota.
>>  * It's a pretty catastrophic failure when we run out.
>>  * Measuring how many you have left is reasonably trivial.
>>
>>  Although our first priority is to stop any leaks (which iirc we did for secgroups already due to some other reasons), I like this idea of monitoring and alerting when the stock goes low.
>> It may seem that for certain solutions, for.e.g. bbb, we *may not see similar type of issues, but it's always best to monitor them, like a stock control system.
>>
>> Also, we could use this collated data to back up any quota increase requests, mentioned somewhere else in this thread.
>>
>>
>>  If I may be permitted to jump into full engineering-implementation mode for
>> a second....
>>
>>
>> Imagine a generic service that simply exposes statistics for the openstack
>> tennant it's deployed in? We coudl then deploy one in every tennant we use,
>> and have the stats monitoring system read those....
>>
>>
>> Anyway, I thought that was an interesting experience :D
>>
>>
>>
>>
>>
>>
>> --
>> Mailing list: https://launchpad.net/~canonical-ci-engineering
>> Post to     : canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>
>
> --
> Celso Providelo
> celso.providelo@xxxxxxxxxxxxx
>
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to     : canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help   : https://help.launchpad.net/ListHelp
>
>


-- 
Úrsula Junque
Ubuntu CI Engineer

ursula.junque@xxxxxxxxxxxxx
ursinha@xxxxxxxxxxx
ursinha@xxxxxxxxxx

Ubuntu - "I am what I am because of who we all are."
Linux user #289453 - Ubuntu user #31144

Follow ups

References