[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migra
From: |
Ivan Ren |
Subject: |
Re: [Qemu-devel] [PATCH 3/3] migration: fix migrate_cancel multifd migration leads destination hung forever |
Date: |
Wed, 24 Jul 2019 19:30:15 +0800 |
> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.
Yes, agree.
Thanks for review.
On Wed, Jul 24, 2019 at 5:01 PM Juan Quintela <address@hidden> wrote:
> Ivan Ren <address@hidden> wrote:
> > When migrate_cancel a multifd migration, if run sequence like this:
> >
> > [source] [destination]
> >
> > multifd_send_sync_main[finish]
> > multifd_recv_thread wait &p->sem_sync
> > shutdown to_dst_file
> > detect error from_src_file
> > send RAM_SAVE_FLAG_EOS[fail] [no chance to run
> multifd_recv_sync_main]
> > multifd_load_cleanup
> > join multifd receive thread forever
> >
> > will lead destination qemu hung at following stack:
> >
> > pthread_join
> > qemu_thread_join
> > multifd_load_cleanup
> > process_incoming_migration_co
> > coroutine_trampoline
> >
> > Signed-off-by: Ivan Ren <address@hidden>
>
> I think this one is not enough. We need to set some error code, or
> disable the running bit at that point.
>
> > ---
> > migration/ram.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index e4eb9c441f..504c8ccb03 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
> > MultiFDRecvParams *p = &multifd_recv_state->params[i];
> >
> > if (p->running) {
> > + /*
> > + * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle
> code,
> > + * however try to wakeup it without harm in cleanup phase.
> > + */
> > + qemu_sem_post(&p->sem_sync);
> > qemu_thread_join(&p->thread);
> > }
> > object_unref(OBJECT(p->c));
>
> Let's see where we wait for p->sem_sync:
>
>
> static void *multifd_recv_thread(void *opaque)
> {
> ....
> while (true) {
> uint32_t used;
> uint32_t flags;
>
> ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
> p->packet_len, &local_err);
> .....
>
> if (flags & MULTIFD_FLAG_SYNC) {
> qemu_sem_post(&multifd_recv_state->sem_sync);
> qemu_sem_wait(&p->sem_sync);
> }
> }
> if (local_err) {
> multifd_recv_terminate_threads(local_err);
> }
> qemu_mutex_lock(&p->mutex);
> p->running = false;
> qemu_mutex_unlock(&p->mutex);
>
> rcu_unregister_thread();
> trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages);
>
> return NULL;
> }
>
> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.
>
> Good catch.
>
> Later, Juan.
>