canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: Paul Larson <paul.larson@xxxxxxxxxxxxx>
From: Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
Date: Thu, 7 May 2015 13:11:19 +1200
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAEVG-yrsMiWoYt7eV6j1hp4Sr+8iC383Xx5HKAh_hK5d3HWu2w@mail.gmail.com>

Hi,

On Thu, May 7, 2015 at 9:28 AM, Paul Larson <paul.larson@xxxxxxxxxxxxx>
wrote:

>
> I think we knew it would be important, but we didn't yet know what needed
> to be measured.
>

Indeed. I don't mean to suggest that I (or anyone) can name all the stats
we want to track for any new system we build - these things will evolve and
change over time. Let's try and capture that in an acceptance criteria: It
seems like we're saying that whatever we deploy must be able to pick up
new/changed metrics easily... I'm thinking along the lines of "I just
re-deploy the service(s) with the new stats enabled and it "just works" -
no need to configure multiple things". So, as a first stab, how about some
acceptance criteria like:

 * System must be able to respond to newly exposed metrics without
requiring manual configuration?

-or perhaps-

* Must be able to track additional metrics, or change existing metrics
without needing to change anything other than the service(s) effected?

ugh - writing these criteria is hard. Suggestions more than welcome - I
think my suggestions above missed the mark somewhat :D

> I'm not sure we still know *everything* that needs measurement, but it's
> worth noting that many of the measurements that would have been really
> useful were not so much about continuing operations, but about comparisons
> between the old and new system. This should have been called out as a gap
> in the acceptance criteria.  I do agree that some operational measurements
> in the future are useful too, but who monitors those? Do we alert on any of
> them?
>

Agreed. How about:

* The system must support alerting via PagerDuty ?

> The big ones that comes to mind could mostly be solved by the queue stats
> that Celso worked on
>

I'm not sure what that is?

> I think unless we are called to revamp the solution later down the road
> and need to understand the performance characteristics again.  I think the
> bigger hole for the moment is alerting, and having a good place to send
> those with a clear path on how to resolve them. ex. deadletter queue flood,
> big spike in queue depth, etc.
>

Agreed. I think though that we want a system that can track these things
long term without any sort of cognative overload. So, we track the
performance metrics now (because we're handing the system over), but we
also continue to track those metrics forwever: we want a system that
doesn't degrade as we track more and more things. We NEVER want to be in a
situation where we're saying "OK, let's stop tracking X, Y, Z, to make room
for these new metrics". For an acceptance criteria, how about something
like:

* The system must be able to track metrics without displaying them.
* The system must be able to track a large number of metrics without
degrading performance (ugh - please help me re-write this).

>
>> I'd love to get some more information on ELK plugins. I don't have much
>> experience with elasticsearch, and the little bit I tried to do (backing up
>> and restoring elasticsearch when we migrated the elk deployment to
>> production) proved to be tricky.
>>
> Unless we are collecting for a limited duration to analyze performance, I
> think we should avoid any requirement for long running metrics.  Then the
> monitoring becomes a critical production service in it's own right - and I
> think unnecessarily in this case.
>
>
hmmmm. I don't think there's anything wrong with gathering a metric that we
don't actively monitor. It's nice to be able to look back at historical
data when something goes wrong - To be able to answer the question "How
does this service's performance compare with last week's?".

I imagine we'll collect 20-30 metrics from most services, but probably only
actively monitor 2 or 3, *until something goes wrong*, at which point
having those additional numbers can be a real boon to debugging / problem
solving.

This is an interesting point though, I'd like to hear some more opinions.

> Prometheus does look pretty cool at first glance, but I haven't looked at
> it in any depth yet. I think it's worth a spike to investigate strengths
> and weakness vs. elk to determine if one or both fit our needs better.
> This could *certainly* be useful for future projects. For existing ones, I
> will assume that retrofitting stats on them is a new story and should be
> approached from the not from the idea of "how do we prove this is better
> than X" but "How do we know when there is a problem in the system, and
> ensure that we have the right data to know what's going on so someone can
> respond to it quickly?"
>
>
Yeah, I agree:

1) Decide what tool we want to use
2) Retro-fit stats collection into current and past services we care about.

Def. two separate stories.

Would really like to hear more opinions here.

Thoughts?
-- 
Thomi Richards
thomi.richards@xxxxxxxxxxxxx

Follow ups

Re: proposal for next sprint
From: Francis Ginther, 2015-05-07

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06