canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
From: Francis Ginther <francis.ginther@xxxxxxxxxxxxx>
Date: Wed, 6 May 2015 22:07:31 -0500
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAFkWetm6xLTBeOmG=5jFcmwCTazXW5nJ2ugOTYcHiSRBE1UrVw@mail.gmail.com>

On Wed, May 6, 2015 at 8:11 PM, Thomi Richards <thomi.richards@xxxxxxxxxxxxx
> wrote:

> Hi,
>
> On Thu, May 7, 2015 at 9:28 AM, Paul Larson <paul.larson@xxxxxxxxxxxxx>
> wrote:
>
>>
>> I think we knew it would be important, but we didn't yet know what needed
>> to be measured.
>>
>
> Indeed. I don't mean to suggest that I (or anyone) can name all the stats
> we want to track for any new system we build - these things will evolve and
> change over time. Let's try and capture that in an acceptance criteria: It
> seems like we're saying that whatever we deploy must be able to pick up
> new/changed metrics easily... I'm thinking along the lines of "I just
> re-deploy the service(s) with the new stats enabled and it "just works" -
> no need to configure multiple things". So, as a first stab, how about some
> acceptance criteria like:
>
>  * System must be able to respond to newly exposed metrics without
> requiring manual configuration?
>
> -or perhaps-
>
> * Must be able to track additional metrics, or change existing metrics
> without needing to change anything other than the service(s) effected?
>

Thanks for starting to define the criteria.


> ugh - writing these criteria is hard. Suggestions more than welcome - I
> think my suggestions above missed the mark somewhat :D
>
>
>
>> I'm not sure we still know *everything* that needs measurement, but it's
>> worth noting that many of the measurements that would have been really
>> useful were not so much about continuing operations, but about comparisons
>> between the old and new system. This should have been called out as a gap
>> in the acceptance criteria.  I do agree that some operational measurements
>> in the future are useful too, but who monitors those? Do we alert on any of
>> them?
>>
>
> Agreed. How about:
>
> * The system must support alerting via PagerDuty ?
>

I didn't consider alerting before, but it's an interesting idea.


> The big ones that comes to mind could mostly be solved by the queue stats
>> that Celso worked on
>>
>
> I'm not sure what that is?
>
>
>> I think unless we are called to revamp the solution later down the road
>> and need to understand the performance characteristics again.  I think the
>> bigger hole for the moment is alerting, and having a good place to send
>> those with a clear path on how to resolve them. ex. deadletter queue flood,
>> big spike in queue depth, etc.
>>
>
> Agreed. I think though that we want a system that can track these things
> long term without any sort of cognative overload. So, we track the
> performance metrics now (because we're handing the system over), but we
> also continue to track those metrics forwever: we want a system that
> doesn't degrade as we track more and more things. We NEVER want to be in a
> situation where we're saying "OK, let's stop tracking X, Y, Z, to make room
> for these new metrics". For an acceptance criteria, how about something
> like:
>
> * The system must be able to track metrics without displaying them.
> * The system must be able to track a large number of metrics without
> degrading performance (ugh - please help me re-write this).
>
>
>>
>>> I'd love to get some more information on ELK plugins. I don't have much
>>> experience with elasticsearch, and the little bit I tried to do (backing up
>>> and restoring elasticsearch when we migrated the elk deployment to
>>> production) proved to be tricky.
>>>
>> Unless we are collecting for a limited duration to analyze performance, I
>> think we should avoid any requirement for long running metrics.  Then the
>> monitoring becomes a critical production service in it's own right - and I
>> think unnecessarily in this case.
>>
>>
I have to disagree on part of this. I think it's important that we collect
long running statistics. This is going to be our tool for going to IS (or
whoever) and requesting more resources. We'll need to be able to backup our
request for 20 more BBBs with some data to show that it's going to meet the
demand. We also need long running statistics to establish a baseline. For
example, right now it may take 2 minutes for uci-nova to setup the testbed,
if we sometime later notice this is now taking 5 minutes, we should be
better able to find the regression.

I do agree that the statistics service should not be required to keep
services running. Just like the logging solution, data should be thrown
over the wall at it and if it's not there, nothing should break.


> hmmmm. I don't think there's anything wrong with gathering a metric that
> we don't actively monitor. It's nice to be able to look back at historical
> data when something goes wrong - To be able to answer the question "How
> does this service's performance compare with last week's?".
>
> I imagine we'll collect 20-30 metrics from most services, but probably
> only actively monitor 2 or 3, *until something goes wrong*, at which point
> having those additional numbers can be a real boon to debugging / problem
> solving.
>

That's what I expect as well. I can see us going overboard at first and
measuring things that end up being meaningless, just like we have log
messages that we realize are irrelevant over time. And that's fine.
Measuring a meaningless thing should not have a noticeably adverse impact
on the system so that we can safely ignore these forever if necessary
(you've already covered this as a criteria above).


> This is an interesting point though, I'd like to hear some more opinions.
>
>
>> Prometheus does look pretty cool at first glance, but I haven't looked at
>> it in any depth yet. I think it's worth a spike to investigate strengths
>> and weakness vs. elk to determine if one or both fit our needs better.
>> This could *certainly* be useful for future projects. For existing ones, I
>> will assume that retrofitting stats on them is a new story and should be
>> approached from the not from the idea of "how do we prove this is better
>> than X" but "How do we know when there is a problem in the system, and
>> ensure that we have the right data to know what's going on so someone can
>> respond to it quickly?"
>>
>>
> Yeah, I agree:
>
> 1) Decide what tool we want to use
> 2) Retro-fit stats collection into current and past services we care about.
>
> Def. two separate stories.
>
>
> Would really like to hear more opinions here.
>
>
> Thoughts?
> --
> Thomi Richards
> thomi.richards@xxxxxxxxxxxxx
>



-- 
Francis Ginther
Canonical - Ubuntu Engineering - Continuous Integration Team

Follow ups

Re: proposal for next sprint
From: Joe Talbott, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07