canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00534
otto containers left running, lxc-stop hanging
Hi,
So, I previously setup jenkins jobs
(http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release/)
using otto to stop containers left running as a catch-all to solve the
deadlock of otto checking for a running container before attempting a
new job. In other words, the design was such that if a job left a
container running, no other jobs could be attempted. The workaround is
to make such jobs check for a container as a Post Build task that is run
even if the job times out or is aborted.
This worked. For some time.
A new case has appeared last Friday
(https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-12-13-qa-intel-4000-kernel-crash)
where 'lxc-stop' would hang for (yet) unknown reasons
(https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1261338 filed).
The only way I could find to get the host back to a working state was to
reboot it :-/
So, after trying 'lxc-stop -t <timeout>' which also hang :-/ I've
settled to:
modified jenkins/stop_running_container
=== modified file 'jenkins/stop_running_container'
--- jenkins/stop_running_container 2013-12-16 08:56:03 +0000
+++ jenkins/stop_running_container 2013-12-16 10:04:30 +0000
@@ -35,13 +35,9 @@
for c in ${RUNNING_CONTAINERS} ; do
echo "W: Will stop '$c' left running and blocking further otto jobs"
# Make sure we'll continue even if the container is not running anymore
- set +e
- # Stop the container by nuking it, codename: Little Boy
- sudo lxc-stop -k -t 120 -n $c
- ret=$?
- if [ $ret -ne 0 ]; then
- # This wasn't enoug, use a more powerful nuke, codename: Fat Man
- (echo "Couldn't stop the container, reboot..."; sleep 20; sudo reboot)&
- fi
- set -e
+
+ # Since 'lxc-stop', 'lxc-stop -k' fail in some contexts, and that
+ # 'lxc-stop -t <timeout>' can hang, just use reboot
+ echo "Couldn't stop the container, rebooting..."
+ sudo shutdown -r now
done
And cherry-picked that change in
http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release.
Note that lp:~vila/otto/stop-running-container is not deployed on all
otto nodes, keep that in mind when deploying further changes there.
I'm seeking feedback from the team for better ideas on how to better
recover from such failures. Some already identified leads being:
- use a kvm with pass trough graphic card so the host is immune to
related crashes (long term, requires significant changes in otto),
- better track the causes of containers left running in otto itself to
we rely on the catch-all less and less as they are fixed.
I've added Stephane in CC for feedback on lxc itself, it's a bit weird
that there is no way to forcefully stop a container or at least get an
error (and not hanging) when this happened.
Vincent
Follow ups