canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00586
PagerDuty
Hi everyone,
We’ve started a trial of PagerDuty for managing our altering of CI
infrastructure issues. Larry has already added in our on-call rotation
for the Vanguard shifts and pointed PagerDuty at our Nagios server.
You should start to get emails whenever an issue crops up during your
shift. You’ll just need to do a few things to get these notifications
sent to your phone:
1) Download the app for iOS or Android and sign in on it, if you use
one of those two platforms.
2) Go back to your desktop browser and to the PagerDuty website. Click
your email address in the top right, then “My Profile.” Add in your
cell phone under SMS.
3) Under Notification Rules you should see “Immediately after an
incident is assigned to me, email me at your.name@xxxxxxxxxxxxx
(Default)” and “Immediately after an incident is assigned to me, push
notify me on Name’s Phone”
4) Click Add Notification Rule and set it to SMS you after 15 minutes.
Notifications will first go to the individual on-call, if they were
not acknowledged and handled within 5 minutes. If the notification is
not handled after 10 minutes, Larry will be notified. If Larry doesn’t
handle it, I’ll be notified after 30 minutes. This should ensure that
nothing falls through the cracks. We’ll need to make some refinements
to cover holidays and create a better fallback policy that’s aware of
who else is around on the time zone in question, but this should be a
good start.
There will be alterable issues not yet wired into Nagios. We need to
process through the Incident Reports and implement whatever check is
needed to ensure we get a notification the next time it happens.
Please add concrete actions for discovered missing checks here:
https://app.asana.com/0/9011593109151/
This is obviously only half of the puzzle. We need to build resilience
into our infrastructure, as you’re already doing with the Airline
work, and as we’ll do by bringing in the emulator as a replacement for
some physical hardware. Do continue to make suggestions for better
resilience in your incident reports.
Please share your experiences. This is a trial and I want to know how
effective you find the combination of Nagios and PagerDuty for
learning about and resolving issues quickly.
Thanks