qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 12/17] python/machine: Handle QMP errors on close more met


From: Hanna Reitz
Subject: Re: [PATCH v2 12/17] python/machine: Handle QMP errors on close more meticulously
Date: Tue, 12 Oct 2021 17:56:59 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.1.0

On 07.10.21 18:52, John Snow wrote:


On Wed, Sep 22, 2021 at 8:50 PM John Snow <jsnow@redhat.com> wrote:

    To use the AQMP backend, Machine just needs to be a little more
    diligent
    about what happens when closing a QMP connection. The operation is no
    longer a freebie in the async world; it may return errors
    encountered in
    the async bottom half on incoming message receipt, etc.

    (AQMP's disconnect, ultimately, serves as the quiescence point
    where all
    async contexts are gathered together, and any final errors reported at
    that point.)

    Because async QMP continues to check for messages asynchronously, it's
    almost certainly likely that the loop will have exited due to EOF
    after
    issuing the last 'quit' command. That error will ultimately be bubbled
    up when attempting to close the QMP connection. The manager class here
    then is free to discard it -- if it was expected.

    Signed-off-by: John Snow <jsnow@redhat.com>

    ---

    Yes, I regret that this class has become quite a dumping ground for
    complexity around the exit path. It's in need of a refactor to help
    separate out the exception handling and cleanup mechanisms from the
    VM-related stuff, but it's not a priority to do that just yet -- but
    it's on the list.

    Signed-off-by: John Snow <jsnow@redhat.com>
    ---
     python/qemu/machine/machine.py | 48
    +++++++++++++++++++++++++++++-----
     1 file changed, 42 insertions(+), 6 deletions(-)

    diff --git a/python/qemu/machine/machine.py
    b/python/qemu/machine/machine.py
    index 0bd40bc2f76..c33a78a2d9f 100644
    --- a/python/qemu/machine/machine.py
    +++ b/python/qemu/machine/machine.py
    @@ -342,9 +342,15 @@ def _post_shutdown(self) -> None:
             # Comprehensive reset for the failed launch case:
             self._early_cleanup()

    -        if self._qmp_connection:
    -            self._qmp.close()
    -            self._qmp_connection = None
    +        try:
    +            self._close_qmp_connection()
    +        except Exception as err:  # pylint: disable=broad-except
    +            LOG.warning(
    +                "Exception closing QMP connection: %s",
    +                str(err) if str(err) else type(err).__name__
    +            )
    +        finally:
    +            assert self._qmp_connection is None

             self._close_qemu_log_file()

    @@ -420,6 +426,31 @@ def _launch(self) -> None:
                                            close_fds=False)
             self._post_launch()

    +    def _close_qmp_connection(self) -> None:
    +        """
    +        Close the underlying QMP connection, if any.
    +
    +        Dutifully report errors that occurred while closing, but
    assume
    +        that any error encountered indicates an abnormal termination
    +        process and not a failure to close.
    +        """
    +        if self._qmp_connection is None:
    +            return
    +
    +        try:
    +            self._qmp.close()
    +        except EOFError:
    +            # EOF can occur as an Exception here when using the Async
    +            # QMP backend. It indicates that the server closed the
    +            # stream. If we successfully issued 'quit' at any point,
    +            # then this was expected. If the remote went away without
    +            # our permission, it's worth reporting that as an
    abnormal
    +            # shutdown case.
    +            if not self._quit_issued:


I strongly suspect that this line needs to actually be "if not (self._user_killed or self._quit_issued):" to also handle the force-kill cases.

Sounds right.

(Other than that, the patch looks good to me.)

I wrote a different sync compatibility layer in a yet-to-be-published branch and noticed this code still allows EOFError through. Why it did not seem to come up in *this* series I still don't know -- I think the timing of when exactly the coroutines get scheduled can be finnicky depending on what else is running. So, sometimes, we manage to close the loop before we get EOF and sometimes we don't.

I am mulling over adding a "tolerate_eof: bool = False" argument to disconnect() to allow the caller to handle this stuff with a little less boilerplate. Anyone have strong feelings on that?

FWIW I don’t.

(Ultimately, because there's no canonical "goodbye" in-band message, the QMP layer itself can never really know if EOF was expected or not -- that's really up to whomever is managing the connection, but it does add a layer of complexity around the exit path that I am starting to find is a nuisance. Not to mention that there are likely many possible cases in which we might expect the remote to disappear on us that don't involve using QMP at all -- kill is one, using the guest agent to coerce the guest into shutdown is another. Networking and migration are other theoretical(?) avenues for intended disconnects.)

For the iotests we don’t have a guest, so I suppose this doesn’t concern me, does it? :)

Hanna




reply via email to

[Prev in Thread] Current Thread [Next in Thread]