← Back to team overview

canonical-ci-engineering team mailing list archive

No image #60 - what went wrong?

 

Image #60 was not successfully produced. This obviously created
problems for the landing team. Let's take a few minutes to brainstorm
on what went wrong and come up with some better policy to prevent this
from happening again.

- Why wasn't an email sent out to the teams affected by the moving of
phones to a new host at least a day or two in advance of the move? We
have done this before with scheduled DC maintenance and with the 1SS
move. We should be doing it every time we make a change that affects
other teams.

- I realise they were unrelated events, but was there any conceivable
way we could've caught the device failure that followed? That is,
could we have kicked off some test runs, or aligned the move to the
image production time? The answer here may be no, but I want to at
least discuss why.

- Siva mentioned that the expected device wasn't appearing in `adb
devices`. Can we have a nagios check for this so we know sooner?

- Is there anything else you think we could've done to better manage
this? Short of moving to the Airline, are there things you think we
could be doing to make us more resilient to this kind of failure?

Thanks!

Context:

8:42 AM <didrocks> cihelp: is it me or the ci dashboard has some
issues? (can be the backend)
8:42 AM <didrocks> no image 60 results
8:42 AM <didrocks> image 61 run all for mako, but stopped on maguro
8:42 AM <didrocks> image 62 should start soon I guess
8:55 AM <psivaa> didrocks: the touch devices were being moved to a new
host last night..
8:55 AM <didrocks> psivaa: hum, did I miss an email?
8:56 AM <psivaa> didrocks: no,
8:56 AM <psivaa> <plars> psivaa: so if didrocks is wondering in the
morning what happened to image 60, it was a victim of moving those
devices to a new host :(
8:56 AM <didrocks> would better to get an email for it :/
8:56 AM <didrocks> ev: can we establish some procedure for this? ^
8:56 AM <didrocks> psivaa: so, the new image is running tests, now?
8:57 AM <psivaa> didrocks: but according to plars the image 61 should
be going along well..
8:57 AM <psivaa> let me check please
8:57 AM <didrocks> psivaa: 61 doesn't have maguro tests
8:57 AM <didrocks> well, didn't finish them
8:58 AM <psivaa> didrocks: yea the device disappeared during camera
app tests. let me see if i can find it in the host
9:05 AM <ev> didrocks: we're supposed to already be doing that. Larry
sends out an email with each hardware move, but I guess the phones
have been considered something of a grey area. I'll make sure the team
knows that we need to be sending out warnings with any kind of change
that would affect running services, including any hardware changes.
9:06 AM <didrocks> ev: yeah, and if it's a planned change as well,
some days in advance can help :)
9:06 AM <didrocks> especially as I was really interesting in the
results from run #60, and we'll never have it :/
9:09 AM <psivaa> didrocks: the particular maguro is not showing up on
the host either.. something must have happened. we need someone to
take a look in person. I'll run the tests with another device for now
9:09 AM <didrocks> psivaa: ok, thanks ;)


Follow ups