canonical-ci-engineering team mailing list archive

Thread
Date

proposal for next sprint

To: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
From: Thomi Richards <thomi.richards@xxxxxxxxxxxxx>
Date: Mon, 4 May 2015 16:29:19 +1200

Friends,


I'd like to propose that we consider working on adding some real-time stats
monitoring for the already deployed microservices in the next sprint.

Why do we need this?
  I'd like to be able to answer some questions like:

   - What's the median / 95th percentile time for package test runs in
   adt-cloud? Is that increasing or decreasing over time?
   - What's the size of the adt-cloud rabbit request queues? Are they
   growing or shrinking?
   - When was the last time a test request failed? What's our MTBF?
   - etc.

Right now we're trying to use logstash & kibana to answer these questions,
but it's really not designed for stats collection.

I've spent the day looking at ways we could approach this.

A common approach seems to be statsd + graphite, but after playing with
them on my laptop I find the graphite UI to be awful, and the entire system
seems really unfriendly and fragile. I'm new to these systems though, so
perhaps someone else here has more experience and can provide me with some
additional information.

Another approach is infuxdb + grafana, but influxdb doesn't seem to be
ready for production use (despite what their website says), also grafana is
a fork of kibana, and has all the same UI issues that kibana does.

OTOH, I also spent some time looking at prometheus (http://prometheus.io/),
which looks really nice:

   - It's pretty self reliant - doesn't need a storage backend.
   - It's micro-service-ey, by which I mean it has total data locality, so
   you can upgrade it by standing up a new instance with new config, swap a
   floating IP over, and tear down the old one (you can also run it on your
   laptop, which is nice for testing configs before deploying them in
   production).
   - It supports both 'push' and 'pull'-based metrics. I can see good use
   cases for both scenarios in our existing infrastructure.
   - The docs are awesome: http://prometheus.io/docs/introduction/overview/
   - I can grab the source, build it, and get it graphing metrics in ~ 10
   minutes (this is my rough metric of future frustration).


So I'm curious - does anyone else see this need? What's the correct way to
propose work for the next sprint? I think this would be a nice piece of
work for someone to work on for the next few weeks. If no one else wants
to, I'll certainly volunteer myself...


Cheers,
-- 
Thomi Richards
thomi.richards@xxxxxxxxxxxxx

Follow ups

Re: proposal for next sprint
From: Francis Ginther, 2015-05-04