← Back to team overview

canonical-ci-engineering team mailing list archive

Re: proposal for next sprint

 

On Sun, May 3, 2015 at 11:29 PM, Thomi Richards <
thomi.richards@xxxxxxxxxxxxx> wrote:

> Friends,
>
>
> I'd like to propose that we consider working on adding some real-time
> stats monitoring for the already deployed microservices in the next sprint.
>
> Why do we need this?
>   I'd like to be able to answer some questions like:
>
>    - What's the median / 95th percentile time for package test runs in
>    adt-cloud? Is that increasing or decreasing over time?
>    - What's the size of the adt-cloud rabbit request queues? Are they
>    growing or shrinking?
>    - When was the last time a test request failed? What's our MTBF?
>    - etc.
>
> I think there are a number of statistical metrics we should monitoring,
and this should really be a part of sprint planning. Like all criteria, we
need to have an idea of what metrics would be useful for the given
solution. Attempting to come up with all possible metrics up front would
lead to many that would only add noise. If we would have had some basic
metrics in place from the beginning (and monitored them) we would have had
better insight into the impacts of the cloud-config additions and ideally
had some better performance comparisons with the existing VM solution.


> Right now we're trying to use logstash & kibana to answer these questions,
> but it's really not designed for stats collection.
>
> I've spent the day looking at ways we could approach this.
>
> A common approach seems to be statsd + graphite, but after playing with
> them on my laptop I find the graphite UI to be awful, and the entire system
> seems really unfriendly and fragile. I'm new to these systems though, so
> perhaps someone else here has more experience and can provide me with some
> additional information.
>

I've also found graphite to be hard to use, but have never taken the time
to learn how to use it. Is there anyone on the team who has a good
experience with graphite.


> Another approach is infuxdb + grafana, but influxdb doesn't seem to be
> ready for production use (despite what their website says), also grafana is
> a fork of kibana, and has all the same UI issues that kibana does.
>

Celso has mentioned using ELK plugins for reporting metrics, this could be
another alternative. I have not looked at this myself.


> OTOH, I also spent some time looking at prometheus (http://prometheus.io/),
> which looks really nice:
>
>    - It's pretty self reliant - doesn't need a storage backend.
>    - It's micro-service-ey, by which I mean it has total data locality,
>    so you can upgrade it by standing up a new instance with new config, swap a
>    floating IP over, and tear down the old one (you can also run it on your
>    laptop, which is nice for testing configs before deploying them in
>    production).
>    - It supports both 'push' and 'pull'-based metrics. I can see good use
>    cases for both scenarios in our existing infrastructure.
>    - The docs are awesome:
>    http://prometheus.io/docs/introduction/overview/
>    - I can grab the source, build it, and get it graphing metrics in ~ 10
>    minutes (this is my rough metric of future frustration).
>
> I've only had a chance to skim the resources so far. From past experience,
push metrics worked for everything, but then again, when it's all that was
available (thinking statsd/graphite) that's all you think about :-).

So I'm curious - does anyone else see this need? What's the correct way to
> propose work for the next sprint? I think this would be a nice piece of
> work for someone to work on for the next few weeks. If no one else wants
> to, I'll certainly volunteer myself...
>

I really like the utility we've established with logging to ELK. It's
become quite painless to add logging content with rich meta-data round it.
If there is a metrics equivalent, I'm all for it.

Thanks for putting this together. Hope to be able to play with prometheus a
bit soon.

Francis


> Cheers,
> --
> Thomi Richards
> thomi.richards@xxxxxxxxxxxxx
>
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to     : canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help   : https://help.launchpad.net/ListHelp
>
>


-- 
Francis Ginther
Canonical - Ubuntu Engineering - Continuous Integration Team

Follow ups

References