← Back to team overview

canonical-ci-engineering team mailing list archive

Re: otto containers left running, lxc-stop hanging

 

>>>>> Stéphane Graber <stephane.graber@xxxxxxxxxxxxx> writes:

<snip/>

    > Did you try "lxc-stop -n <container> -k" which is the upstream supported
    > way of forcefully killing a container?

Yes. As mentioned, I even tried lxc-stop -k -t <timeout>

    > In theory lxc-stop sends SIGPWR, then waits 30s and sends SIGKILL to
    > init.

Ha good. So may be I didn't wait enough on my last test but I'm pretty
sure I did.

Now, while debugging this I indeed tried to kill the init process as at
least compiz and X was listed as defunct.

    > If SIGKILL doesn't work, then you have much bigger problems
    > (typically kernel related).

That could very well be the case.

But then, I would expect lxc-stop to fail with some error code and
respect the -t timeout. In which case I can fallback to reboot but only
in that case.

    > So please try with -k, 

I did.

    > if that doesn't work,

It didn't.

    > please let me access one of those hanging machines so I can
    > confirm that it's not an LXC issue and that something in the
    > kernel is indeed making one of the tasks unkillable.

With pleasure, but that will have to wait :-/

I had to put the reboot hack in place to restore service, we'll need to
plan an interruption to give you access (I don't think we can reproduce
that on a different host).

And I'll be sprinting this week and be in vacations for the next 2 weeks.

But rest assured I'll get back to you ;)

So thanks a lot for the quick feedback (on the bug too !). 

I'm pretty sure you're right about the deeper kernel issue, it matches
my tests last Friday, I couldn't kill the init process and I had issues
killing the other ones so... I had to reboot in the end.

And stay tuned, I'll ping you as soon as I can setup a reproducing env ;)

    Vincent


References