canonical-ci-engineering team mailing list archive
-
canonical-ci-engineering team
-
Mailing list archive
-
Message #00537
Re: otto containers left running, lxc-stop hanging
On Mon, Dec 16, 2013 at 11:49:44AM +0100, Vincent Ladeuil wrote:
> Hi,
>
> So, I previously setup jenkins jobs
> (http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release/)
> using otto to stop containers left running as a catch-all to solve the
> deadlock of otto checking for a running container before attempting a
> new job. In other words, the design was such that if a job left a
> container running, no other jobs could be attempted. The workaround is
> to make such jobs check for a container as a Post Build task that is run
> even if the job times out or is aborted.
>
> This worked. For some time.
>
> A new case has appeared last Friday
> (https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-12-13-qa-intel-4000-kernel-crash)
> where 'lxc-stop' would hang for (yet) unknown reasons
> (https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1261338 filed).
>
> The only way I could find to get the host back to a working state was to
> reboot it :-/
>
> So, after trying 'lxc-stop -t <timeout>' which also hang :-/ I've
> settled to:
>
> modified jenkins/stop_running_container
>
>
> === modified file 'jenkins/stop_running_container'
> --- jenkins/stop_running_container 2013-12-16 08:56:03 +0000
> +++ jenkins/stop_running_container 2013-12-16 10:04:30 +0000
> @@ -35,13 +35,9 @@
> for c in ${RUNNING_CONTAINERS} ; do
> echo "W: Will stop '$c' left running and blocking further otto jobs"
> # Make sure we'll continue even if the container is not running anymore
> - set +e
> - # Stop the container by nuking it, codename: Little Boy
> - sudo lxc-stop -k -t 120 -n $c
> - ret=$?
> - if [ $ret -ne 0 ]; then
> - # This wasn't enoug, use a more powerful nuke, codename: Fat Man
> - (echo "Couldn't stop the container, reboot..."; sleep 20; sudo reboot)&
> - fi
> - set -e
> +
> + # Since 'lxc-stop', 'lxc-stop -k' fail in some contexts, and that
> + # 'lxc-stop -t <timeout>' can hang, just use reboot
> + echo "Couldn't stop the container, rebooting..."
> + sudo shutdown -r now
> done
>
> And cherry-picked that change in
> http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release.
>
> Note that lp:~vila/otto/stop-running-container is not deployed on all
> otto nodes, keep that in mind when deploying further changes there.
>
> I'm seeking feedback from the team for better ideas on how to better
> recover from such failures. Some already identified leads being:
>
> - use a kvm with pass trough graphic card so the host is immune to
> related crashes (long term, requires significant changes in otto),
>
> - better track the causes of containers left running in otto itself to
> we rely on the catch-all less and less as they are fixed.
>
> I've added Stephane in CC for feedback on lxc itself, it's a bit weird
> that there is no way to forcefully stop a container or at least get an
> error (and not hanging) when this happened.
Did you try "lxc-stop -n <container> -k" which is the upstream supported
way of forcefully killing a container?
In theory lxc-stop sends SIGPWR, then waits 30s and sends SIGKILL to
init. If SIGKILL doesn't work, then you have much bigger problems
(typically kernel related).
So please try with -k, if that doesn't work, please let me access one of
those hanging machines so I can confirm that it's not an LXC issue and
that something in the kernel is indeed making one of the tasks
unkillable.
>
> Vincent
--
Stéphane Graber
Ubuntu developer
http://www.canonical.com
Attachment:
signature.asc
Description: Digital signature
Follow ups
References