← Back to team overview

canonical-ci-engineering team mailing list archive

Re: proposal for next sprint

 

Hi


A heavily trimmed reply is inline below:

On Fri, May 8, 2015 at 7:39 AM, Francis Ginther <
francis.ginther@xxxxxxxxxxxxx> wrote:
>
>
>> On Thu, May 7, 2015 at 12:14 PM, Celso Providelo <
>> celso.providelo@xxxxxxxxxxxxx> wrote:
>>
>>>
>>> There are undeniable visualisation limitations with kibana3 and that's
>>> the possible motivation for thinking about Graphite and Prometheus,
>>> however, frankly speaking, they look pretty much the same from the features
>>> they provide and will require extra infrastructure deployment (and
>>> maintenance) for providing better visualisation of the data we already have
>>> (and can easily augment, if necessary). If the problem is indeed only
>>> visualisation, let's evaluate kibana4 [2], which would provide a much
>>> smoother migration path.
>>>
>>
> I wasn't aware kibana4 had these capabilities, thanks for providing the
> insight. I find there are several aspects of kibana that I'm unaware of. I
> get that it is a data aggregator, but I feel I'm only scratching the
> surface of what I could be doing with it. I'm very interested in any
> service that lets me push gobs of blobs of data to it so that I can then
> pick and choose what I do with it later.
>

I also wasn't aware of kibana4.

I think we should evaluate it alongside all the other options according to
whatever acceptance criteria we decide. I notice that we're once again
drifting away from defining acceptance criteria :D

I don't think we should give it a high priority because we happen to
already be running kibana3 for logging. The only criteria should be "How
well does it meet our acceptance critera" - it's not like deploying any of
the services mentioned will be particularly taxing...

We have acceptance criteria for logging and we *will have* acceptance
criteria for monitoring. If one system can provide the best in both worlds,
then great! If not, then we should always pick the right tool for the job.
After all, that's supposed to be one of the advantages for micro-services.


>
> Performance visualisation on heterogeneous tasks is a complex problem,
>>> despite of the tool kibana, graphite or prometheus, even if we push
>>> individual steps duration (extra['duration']) on events, I am struggling to
>>> see how we could make a lot of sense of these data in a periodic series
>>> without being restricted to filtering individual sources (even though it
>>> would be tied to increase/shrink of tests). Anyway, it would be much
>>> cheaper to push extra data in the existing events and see how they could be
>>> combined/visualised in kibana and maybe that's the most efficient and
>>> useful experiment/spike we could at this point.
>>>
>>
> Indeed monitoring something like average test time for packages going
> through proposed migration is meaningless to me as well but using the same
> data to determine the percentage of time that a worker is busy is
> meaningful. That's the kind of data I would want to know before changing
> the scaling of a service. Perhaps solving this kind of problem is not a
> priority right now, that's ok. But when the time comes, I sure would like
> to be able to add that metric quickly if I didn't already have it.
>

>

Indeed. If we have a system that meets the acceptance criteria "is simple
to add and edit metrics (e.g.- does not require manual configuration on the
collection system)" then the cost for adding metrics is low. I think
there's a case to be made for tracking certain items from the get-go, and
adding others as and when we feel we need them.


>
>>> This task looks small and objective enough to fit in a spike-story and
>>> would move us consistently forward on this subject.
>>>
>>
> I really think alerting is a completely orthogonal topic. It's a good
> topic, but it does deserve its own spike and its own priority.
>

I agree - alerting is super important, but I think it should be considered
separately, once we have the data we want to alert on. 1. Collect data. 2.
Alert based on data.


>
>
>>
>>> Let me briefly give you my take on this wish of monitoring *everything*
>>> hoping they will someday matter, like tenant quota usage, individual unit
>>> raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause
>>> Analysis mindset we were taught our entire life [3], but this
>>> simplification becomes suboptimal for complex systems, where solving the
>>> 'root-cause' often uncover new problems with new effects not monitored
>>> before, i.e. monitoring all the possible cause-effect combinations becomes
>>> expensive because of their unpredictable relationships, we will never
>>> monitor enough to prevent problems.
>>>
>>
> I get that it is futile to monitor everything. But at the same time I
> don't understand why it's pointless to monitoring anything before there is
> a problem. We built these systems, we have intuition where it's inefficient
> and where our customers will have complaints ("Why isn't it faster?"). In
> my head, there are already problems we are aware of, we just don't know to
> what degree they are a problem. They fact that they are distributed systems
> doesn't change that for me. This is my view of (1) above.
>
>
I get that root-cause-analysis for complex systems is hard, but
"cause-analysis" can be easy. In my example, the symptom was "OMG we're
running out of keypairs". The cause is "we're leaking keypairs". Who cares
about the root cause? There is a problem (running out of keypairs), and if
we don't fix it our system will break, and will break _hard_. Let's track
these things which we know, if they ever happen, will break our services.

We'll never be smart enough to predict all future sources of failure, but
we know from experience that there's a class of failure we can monitor and
alert on.



> A practical example is the keypair leaking from uci-nova when port quota
>>> is exhausted. While monitoring keypair quota looks useful for identifying
>>> that there is a leak, unfortunately it would not point us to the *real*
>>> cause of the problem, we would still need a human to interpret results and
>>> decide how to sort it out and meanwhile the problem would escalate and end
>>> up affecting the service availability.
>>>
>>> For instance, instead of passively trying to collect isolated data and
>>> hope a human would show up to sort it out quickly, we could simply
>>> kill/stop cloud-workers units that resulted in exit_code 16, that would
>>> arguably decrease system throughput, but would control/cease the damage
>>> without exposing unavailability to users while we analyse and solve the
>>> problem.
>>>
>>>
Yeah - circuit breakers would be wonderful, and I think we should consider
this as a separate investigation. However, you're going to need that data
to be collected and stored somehow before you can build something that
reads it. Maybe we look at auto-scaling and circuit-breaking (the two seem
entwined to my mind) as experiment N+1?


Cheers,

-- 
Thomi Richards
thomi.richards@xxxxxxxxxxxxx

Follow ups

References