canonical-ci-engineering team mailing list archive

Thread
Date

Long-running nagios checks

To: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
From: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
Date: Mon, 7 Jul 2014 14:17:42 +0100
Sender: evan@xxxxxxxxxxxxxx

As mentioned in the team meeting, NRPE checks cannot take a long time
to complete without complications resulting. I spoke to James about
this a moment ago.

There's a 30 second response timeout in NRPE (nagios
remote-execution). The way they and other prodstack-deployed teams
work around this is by driving the test from cron. On success this
writes a success message into a file on disk. On failure it writes a
failure message into this file. NRPE then checks both that the
timestamp of this file is recent and that it contains the success
message. This covers both the cron job itself failing (the file
doesn't exist or hasn't been updated in a while) and the test itself
failing.

He said the code for this is buried in the depths of lp:canonical-is-puppet.

As one example, cron¹ calls the u1db engine status check², which calls
nagios' check_http on the local wsgi server and dumps the results to
disk³. This is then read by the nrpe-called check⁴.

¹ ./modules/ubuntuone/templates/u1db-engines-check-cron.erb
² /srv/<%= vhost_name %>/var/nagios/engines_status
³ ./modules/ubuntuone/templates/get_u1db_engines_status.sh.erb
⁴ ./modules/ubuntuone/files/check_u1db_engines_status.py