[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] monitor/qmp: resume monitor when clearing its queue
From: |
Wolfgang Bumiller |
Subject: |
Re: [PATCH] monitor/qmp: resume monitor when clearing its queue |
Date: |
Wed, 9 Oct 2019 12:10:32 +0200 |
User-agent: |
NeoMutt/20180716 |
On Wed, Oct 09, 2019 at 10:39:44AM +0200, Markus Armbruster wrote:
> Cc: Marc-André for additional monitor and chardev expertise.
>
> Wolfgang Bumiller <address@hidden> writes:
>
> > When a monitor's queue is filled up in handle_qmp_command()
> > it gets suspended. It's the dispatcher bh's job currently to
> > resume the monitor, which it does after processing an event
> > from the queue. However, it is possible for a
> > CHR_EVENT_CLOSED event to be processed before before the bh
> > is scheduled, which will clear the queue without resuming
> > the monitor, thereby preventing the dispatcher from reaching
> > the resume() call.
>
> Because with the request queue cleared, there's nothing for
> monitor_qmp_requests_pop_any_with_lock() to pop, so
> monitor_qmp_bh_dispatcher() won't look at this monitor. It stays
> suspended forever. Correct?
>
> Observable effect for the monitor's user?
Yes. More easily triggered now with oob. We ran into this a longer time
ago, but our only reliable trigger was a customized version of
-loadstate which loads the state from a separate file instead of the
vmstate region of a qcow2. Turns out that doing this on a slow storage
(~12s to load the data) caused our status daemon to try to poll the qmp
socket during the load-state and give up after a 3s timeout. And since
the BH runs in the main loop which is not even entered until after the
loadstate has finished, but iothread handling the qmp socket does fill &
clear the queue, the qmp socket always ended up unusable afterwards.
Aside from that we have users reporting the same symptom (hanging qmp)
appearing randomly on busy systems.
> > Fix this by resuming the monitor when clearing a queue which
> > was filled up.
> >
> > Signed-off-by: Wolfgang Bumiller <address@hidden>
> > ---
> > @Michael, we ran into this with qemu 4.0, so if the logic in this patch
> > is correct it may make sense to include it in the 4.0.1 roundup.
> > A backport is at [1] as 4.0 was before the monitor/ dir split.
> >
> > [1]
> > https://gitlab.com/wbumiller/qemu/commit/9d8bbb5294ed084f282174b0c91e1a614e0a0714
> >
> > monitor/qmp.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/monitor/qmp.c b/monitor/qmp.c
> > index 9d9e5d8b27..c1db5bf940 100644
> > --- a/monitor/qmp.c
> > +++ b/monitor/qmp.c
> > @@ -70,9 +70,19 @@ static void qmp_request_free(QMPRequest *req)
> > /* Caller must hold mon->qmp.qmp_queue_lock */
> > static void monitor_qmp_cleanup_req_queue_locked(MonitorQMP *mon)
> > {
> > + bool need_resume = (!qmp_oob_enabled(mon) && mon->qmp_requests->length
> > > 0)
> > + || mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX;
>
> Can you explain why this condition is correct?
Sorry, I meant to add a comment pointing to monitor_qmp_bh_dispatcher(),
which does the following *after* popping 1 element off the queue:
need_resume = !qmp_oob_enabled(mon) ||
mon->qmp_requests->length == QMP_REQ_QUEUE_LEN_MAX - 1;
qemu_mutex_unlock(&mon->qmp_queue_lock);
It's supposed to be the same condition, but _before_ popping off an
element (hence no `- 1`), but the queue shouldn't be empty as well
otherwise the `monitor_suspend()` in `handle_qmp_command()` hasn't
happened, though on second though we could probably just return early in
that case.).
>
> > while (!g_queue_is_empty(mon->qmp_requests)) {
> > qmp_request_free(g_queue_pop_head(mon->qmp_requests));
> > }
> > + if (need_resume) {
> > + /*
> > + * Pairs with the monitor_suspend() in handle_qmp_command() in
> > case the
> > + * queue gets cleared from a CH_EVENT_CLOSED event before the
> > dispatch
> > + * bh got scheduled.
> > + */
> > + monitor_resume(&mon->common);
> > + }
> > }
> >
> > static void monitor_qmp_cleanup_queues(MonitorQMP *mon)
>
> Is monitor_qmp_cleanup_req_queue_locked() the correct place?
>
> It's called from
>
> * monitor_qmp_event() case CHR_EVENT_CLOSED via
> monitor_qmp_cleanup_queues(), as part of destroying the monitor's
> session state.
>
> This is the case you're trying to fix. Correct?
>
> I figure monitor_resume() is safe because we haven't really destroyed
> anything, yet, we merely flushed the request queue. Correct?
>
> * monitor_data_destroy() via monitor_data_destroy_qmp() when destroying
> the monitor.
>
> Can need_resume be true in this case? If yes, is monitor_resume()
> still safe? We're in the middle of destroying the monitor...
I thought so when first reading through it, but on second though, we
should probably avoid this for sanity's sake.
Maybe with a flag, or an extra parameter.
Or we could introduce a "bool queue_filled" we set in handle_qmp_command()
instead of "calculating" `need_resume` in 2 places and unset it in
`monitor_data_destroy()` before clearing the queue?