canonical-ci-engineering team mailing list archive

Thread
Date

Re: proposal for next sprint

To: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
From: Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
Date: Mon, 11 May 2015 14:25:21 +1200
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAOe9oG6PMvX=KmE1oe3JT+vb-rqsnD1oRwTVRjaLjFTDhexMWw@mail.gmail.com>

Hi all,


Evan and I just had a hangout, and were able to add some transparency to
this discussion. What follows is a brain dump:

There are three critical areas we need to pay some attention to:


   - *Metrics*. We need a metrics system so that, when issues arise, we can
   quickly and easily track whatever it is we're interested in. Ideally we'd
   be able to re-deploy a service with a single new line of code to start
   tracking a new thing. I guess this will mostly be for performance-related
   measurements, but could also be useful in other debugging efforts.
   - *Failure Processing*.  We need a way to process everything that ends
   up on the dead letter queue. Currently it's a black hole where messages go
   in, but we never look at them. We need some way of inspecting the queue
   contents, (optionally) making some changes to the system, and then putting
   those messages back into the input queue in the system for them to get
   retried.
   - *Jenkins needs to die*. The reason we can't do 'failure processing'
   (above) right now is that jenkins has a job timeout, so in order to make
   things fail quickly, we post results even in the case of dead-lettered
   messages. We need to:
      - Eliminate jenkins timeouts
      - Make jenkins wait for the results to appear
      - Manually inspect the dead letter queue, fix issues, re-inject dead
      lettered messages into the system.
      - Eventually every job will have proper results (i.e.- without
      infrastructure failures).


I think these three issues deserve to be discussed separately. The mistake
I made a few emails ago was to conflate metrics monitoring with failure
avoidance. Let's keep the remaining discussion on this topic focussed on
monitoring metrics (I'd love to have the other discussions, but let's do
that in a separate thread).

With that in mind, we still need acceptance criteria for this experiment.
Personally, I'd still like to propose that we push forward with Prometheus,
and at the end of the sprint we can evaluate how well it achieved the
acceptance criteria.

As far as specific criteria go, I think there are several areas we want to
measure:

- ease of adding / removing / changing metrics to the system.
- ease of making sense of the data & visualising metrics.
- deployment concerns - data locality, IS production checklist etc (just
like every other service we deploy).

But maybe I missed some?

Does anyone want to take a crack at turning these into actual acceptance
criteria?


Cheers,


-- 
Thomi Richards
thomi.richards@xxxxxxxxxxxxx

References

proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Francis Ginther, 2015-05-04
Re: proposal for next sprint
From: Thomi Richards, 2015-05-04
Re: proposal for next sprint
From: Paul Larson, 2015-05-06
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: psivaa, 2015-05-07
Re: proposal for next sprint
From: Celso Providelo, 2015-05-07
Re: proposal for next sprint
From: Ursula Junque, 2015-05-07
Re: proposal for next sprint
From: Francis Ginther, 2015-05-07
Re: proposal for next sprint
From: Thomi Richards, 2015-05-07
Re: proposal for next sprint
From: Evan Dandrea, 2015-05-08
Re: proposal for next sprint
From: Thomi Richards, 2015-05-08
Re: proposal for next sprint
From: Evan Dandrea, 2015-05-10