← Back to team overview

canonical-ci-engineering team mailing list archive

Re: otto containers left running, lxc-stop hanging

 

On Mon, Dec 16, 2013 at 11:49:44AM +0100, Vincent Ladeuil wrote:
> Hi,
> 
> So, I previously setup jenkins jobs
> (http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release/)
> using otto to stop containers left running as a catch-all to solve the
> deadlock of otto checking for a running container before attempting a
> new job. In other words, the design was such that if a job left a
> container running, no other jobs could be attempted. The workaround is
> to make such jobs check for a container as a Post Build task that is run
> even if the job times out or is aborted.
> 
> This worked. For some time.
> 
> A new case has appeared last Friday
> (https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-12-13-qa-intel-4000-kernel-crash)
> where 'lxc-stop' would hang for (yet) unknown reasons
> (https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1261338 filed).
> 
> The only way I could find to get the host back to a working state was to
> reboot it :-/
> 
> So, after trying 'lxc-stop -t <timeout>' which also hang :-/ I've
> settled to:
> 
>    modified      jenkins/stop_running_container
>                                                                         
> 
> === modified file 'jenkins/stop_running_container'
> --- jenkins/stop_running_container	2013-12-16 08:56:03 +0000
> +++ jenkins/stop_running_container	2013-12-16 10:04:30 +0000
> @@ -35,13 +35,9 @@
>  for c in ${RUNNING_CONTAINERS} ; do
>      echo "W: Will stop '$c' left running and blocking further otto jobs"
>      # Make sure we'll continue even if the container is not running anymore
> -    set +e
> -    # Stop the container by nuking it, codename: Little Boy
> -    sudo lxc-stop -k -t 120 -n $c
> -    ret=$?
> -    if [ $ret -ne 0 ]; then
> -        # This wasn't enoug, use a more powerful nuke, codename: Fat Man
> -        (echo "Couldn't stop the container, reboot..."; sleep 20; sudo reboot)&
> -    fi
> -    set -e
> +
> +    # Since 'lxc-stop', 'lxc-stop -k' fail in some contexts, and that
> +    # 'lxc-stop -t <timeout>' can hang, just use reboot
> +    echo "Couldn't stop the container, rebooting..."
> +    sudo shutdown -r now
>      done
> 
> And cherry-picked that change in
> http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release.
> 
> Note that lp:~vila/otto/stop-running-container is not deployed on all
> otto nodes, keep that in mind when deploying further changes there.
> 
> I'm seeking feedback from the team for better ideas on how to better
> recover from such failures. Some already identified leads being:
> 
> - use a kvm with pass trough graphic card so the host is immune to
>   related crashes (long term, requires significant changes in otto),
> 
> - better track the causes of containers left running in otto itself to
>   we rely on the catch-all less and less as they are fixed.
> 
> I've added Stephane in CC for feedback on lxc itself, it's a bit weird
> that there is no way to forcefully stop a container or at least get an
> error (and not hanging) when this happened.

Did you try "lxc-stop -n <container> -k" which is the upstream supported
way of forcefully killing a container?

In theory lxc-stop sends SIGPWR, then waits 30s and sends SIGKILL to
init. If SIGKILL doesn't work, then you have much bigger problems
(typically kernel related).

So please try with -k, if that doesn't work, please let me access one of
those hanging machines so I can confirm that it's not an LXC issue and
that something in the kernel is indeed making one of the tasks
unkillable.

> 
>       Vincent

-- 
Stéphane Graber
Ubuntu developer
http://www.canonical.com

Attachment: signature.asc
Description: Digital signature


Follow ups

References