canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: Francis Ginther <francis.ginther@xxxxxxxxxxxxx>
From: Joe Talbott <joe.talbott@xxxxxxxxxxxxx>
Date: Thu, 7 May 2015 11:16:49 -0400
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAB2r3jL2d8q9F9EM2+uS84U_VT41fGJ8H-+P6FsfHMfSN9vv7Q@mail.gmail.com>
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, May 06, 2015 at 10:07:31PM -0500, Francis Ginther wrote:
> On Wed, May 6, 2015 at 8:11 PM, Thomi Richards <thomi.richards@xxxxxxxxxxxxx
> > wrote:
> 
> > Hi,
> >
> > On Thu, May 7, 2015 at 9:28 AM, Paul Larson <paul.larson@xxxxxxxxxxxxx>
> > wrote:
> >
> >>
> >> I think we knew it would be important, but we didn't yet know what needed
> >> to be measured.
> >>
> >
> > Indeed. I don't mean to suggest that I (or anyone) can name all the stats
> > we want to track for any new system we build - these things will evolve and
> > change over time. Let's try and capture that in an acceptance criteria: It
> > seems like we're saying that whatever we deploy must be able to pick up
> > new/changed metrics easily... I'm thinking along the lines of "I just
> > re-deploy the service(s) with the new stats enabled and it "just works" -
> > no need to configure multiple things". So, as a first stab, how about some
> > acceptance criteria like:
> >
> >  * System must be able to respond to newly exposed metrics without
> > requiring manual configuration?
> >
> > -or perhaps-
> >
> > * Must be able to track additional metrics, or change existing metrics
> > without needing to change anything other than the service(s) effected?
> >
> 
> Thanks for starting to define the criteria.
> 
> 
> > ugh - writing these criteria is hard. Suggestions more than welcome - I
> > think my suggestions above missed the mark somewhat :D
> >
> >
> >
> >> I'm not sure we still know *everything* that needs measurement, but it's
> >> worth noting that many of the measurements that would have been really
> >> useful were not so much about continuing operations, but about comparisons
> >> between the old and new system. This should have been called out as a gap
> >> in the acceptance criteria.  I do agree that some operational measurements
> >> in the future are useful too, but who monitors those? Do we alert on any of
> >> them?
> >>
> >
> > Agreed. How about:
> >
> > * The system must support alerting via PagerDuty ?
> >
> 
> I didn't consider alerting before, but it's an interesting idea.
> 
> 
> > The big ones that comes to mind could mostly be solved by the queue stats
> >> that Celso worked on
> >>
> >
> > I'm not sure what that is?
> >
> >
> >> I think unless we are called to revamp the solution later down the road
> >> and need to understand the performance characteristics again.  I think the
> >> bigger hole for the moment is alerting, and having a good place to send
> >> those with a clear path on how to resolve them. ex. deadletter queue flood,
> >> big spike in queue depth, etc.
> >>
> >
> > Agreed. I think though that we want a system that can track these things
> > long term without any sort of cognative overload. So, we track the
> > performance metrics now (because we're handing the system over), but we
> > also continue to track those metrics forwever: we want a system that
> > doesn't degrade as we track more and more things. We NEVER want to be in a
> > situation where we're saying "OK, let's stop tracking X, Y, Z, to make room
> > for these new metrics". For an acceptance criteria, how about something
> > like:
> >
> > * The system must be able to track metrics without displaying them.
> > * The system must be able to track a large number of metrics without
> > degrading performance (ugh - please help me re-write this).
> >
> >
> >>
> >>> I'd love to get some more information on ELK plugins. I don't have much
> >>> experience with elasticsearch, and the little bit I tried to do (backing up
> >>> and restoring elasticsearch when we migrated the elk deployment to
> >>> production) proved to be tricky.
> >>>
> >> Unless we are collecting for a limited duration to analyze performance, I
> >> think we should avoid any requirement for long running metrics.  Then the
> >> monitoring becomes a critical production service in it's own right - and I
> >> think unnecessarily in this case.
> >>
> >>
> I have to disagree on part of this. I think it's important that we collect
> long running statistics. This is going to be our tool for going to IS (or
> whoever) and requesting more resources. We'll need to be able to backup our
> request for 20 more BBBs with some data to show that it's going to meet the
> demand. We also need long running statistics to establish a baseline. For
> example, right now it may take 2 minutes for uci-nova to setup the testbed,
> if we sometime later notice this is now taking 5 minutes, we should be
> better able to find the regression.
> 
> I do agree that the statistics service should not be required to keep
> services running. Just like the logging solution, data should be thrown
> over the wall at it and if it's not there, nothing should break.
> 
> 
> > hmmmm. I don't think there's anything wrong with gathering a metric that
> > we don't actively monitor. It's nice to be able to look back at historical
> > data when something goes wrong - To be able to answer the question "How
> > does this service's performance compare with last week's?".
> >
> > I imagine we'll collect 20-30 metrics from most services, but probably
> > only actively monitor 2 or 3, *until something goes wrong*, at which point
> > having those additional numbers can be a real boon to debugging / problem
> > solving.
> >
> 
> That's what I expect as well. I can see us going overboard at first and
> measuring things that end up being meaningless, just like we have log
> messages that we realize are irrelevant over time. And that's fine.
> Measuring a meaningless thing should not have a noticeably adverse impact
> on the system so that we can safely ignore these forever if necessary
> (you've already covered this as a criteria above).

One thing to consider is the cost of implementing measurements that are
meaningless.  I think we are trying to get an interesting and
potentially useful bit of engineering work in that we don't *know* we
need yet.  I think Thomi is spot on regarding the quota usage
monitoring.  But I think we should *just* implement that bit and add
metrics as we identify their utility.  Remember start small and build
out from there.

Thanks,
Joe

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07