Re: QEMU migration-test CI intermittent failure

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: QEMU migration-test CI intermittent failure

From:	Peter Xu
Subject:	Re: QEMU migration-test CI intermittent failure
Date:	Fri, 15 Sep 2023 12:28:36 -0400
On Thu, Sep 14, 2023 at 10:56:23PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Sep 14, 2023 at 07:54:17PM -0300, Fabiano Rosas wrote:
> >> Fabiano Rosas <farosas@suse.de> writes:
> >> 
> >> > Peter Xu <peterx@redhat.com> writes:
> >> >
> >> >> On Thu, Sep 14, 2023 at 12:57:08PM -0300, Fabiano Rosas wrote:
> >> >>> I managed to reproduce it. It's not the return path error. In hindsight
> >> >>> that's obvious because that error happens in the 'recovery' test and 
> >> >>> this
> >> >>> one in the 'plain' one. Sorry about the noise.
> >> >>
> >> >> No worry.  It's good to finally identify that.
> >> >>
> >> >>> 
> >> >>> This one reproduced with just 4 iterations of preempt/plain. I'll
> >> >>> investigate.
> >> >
> >> > It seems that we're getting a tcp disconnect (ECONNRESET) on when doing
> >> > that shutdown() on postcopy_qemufile_src. The one from commit 6621883f93
> >> > ("migration: Fix potential race on postcopy_qemufile_src").
> >> >
> >> > I'm trying to determine why that happens when other times it just
> >> > returns 0 as expected.
> >> >
> >> > Could this mean that we're kicking the dest too soon while it is still
> >> > receiving valid data?
> >> 
> >> Looking a bit more into this, what's happening is that
> >> postcopy_ram_incoming_cleanup() is shutting the postcopy_qemufile_dst
> >> while ram_load_postcopy() is still running.
> >> 
> >> The postcopy_ram_listen_thread() function waits for the
> >> main_thread_load_event, but that only works when not using preempt. With
> >> the preempt thread, the event is set right away and we proceed to do the
> >> cleanup without waiting.
> >> 
> >> So the assumption of commit 6621883f93 that the incoming side knows when
> >> it has finished migrating is wrong IMO. Without the EOS we're relying on
> >> the chance that the shutdown() happens after the last recvmsg has
> >> returned and not during it.
> >> 
> >> Peter, what do you think?
> >
> > That's a good point.
> >
> > One thing to verify that (sorry, I still cannot reproduce it myself, which
> > is so weirdly... it seems loads won't really help reproducing this) is to
> > let the main thread wait for all requested pages to arrive:
> >
> > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> > index 29aea9456d..df055c51ea 100644
> > --- a/migration/postcopy-ram.c
> > +++ b/migration/postcopy-ram.c
> > @@ -597,6 +597,12 @@ int 
> > postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> >      trace_postcopy_ram_incoming_cleanup_entry();
> >  
> >      if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
> > +        /*
> > +         * NOTE!  it's possible that the preempt thread is still handling
> > +         * the last pages to arrive which were requested by faults.  Making
> > +         * sure nothing is left behind.
> > +         */
> > +        while (qatomic_read(&mis->page_requested_count));
> >          /* Notify the fast load thread to quit */
> >          mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
> >          if (mis->postcopy_qemufile_dst) {
> >
> > If that can work solidly, we can figure out a better way than a dead loop
> > here.
> 
> Yep, 2000+ iterations so far and no error.

Definitely good news.. thanks.

> 
> Should we add something that makes ram_load_postcopy return once it's
> finished? Then this code could just set PREEMPT_THREAD_QUIT and join the
> preempt thread.

Yeah it'll be nice to have that. We used to have an EOS message at the end
for the preempt channel, but that used to cause another race (which IIRC I
could easily reproduce at that time) and we removed it after
6621883f9398bc3 for 7.2+.

Now instead of fiddling with the protocol again, we do have a way to detect
that, but it may need some trick on counting on page_requested_count and
also provide a signal mechanism.

I've drafted one solution here and attached here as a complete patch, feel
free to try.  If you have anything better, that'll be even nicer; just let
me know your thoughts.

---8<---
>From 384aff2264650872e15916dcfeaec593e5e9b781 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 15 Sep 2023 12:11:34 -0400
Subject: [PATCH] migration: Fix race that dest preempt thread close too early

We hit intermit CI issue on failing at migration-test over the unit test
preempt/plain:

qemu-system-x86_64: Unable to read from socket: Connection reset by peer
Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc current 
= 4f hit_edge = 1
**
ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion failed: 
(bad == 0)
(test program exited with status code -6)

Fabiano debugged into it and found that the preempt thread can quit even
without receiving all the pages, which can cause guest not receiving all
the pages and corrupt the guest memory.

To make sure preempt thread finished receiving all the pages, we can rely
on the page_requested_count being zero because preempt channel will only
receive requested page faults. Note, not all the faulted pages are required
to be sent via the preempt channel/thread; imagine the case when a
requested page is just queued into the background main channel for
migration, the src qemu will just still send it via the background channel.

Here instead of spinning over reading the count, we add a condvar so the
main thread can wait on it if that unusual case happened, without burning
the cpu for no good reason, even if the duration is short; so even if we
spin in this rare case is probably fine.  It's just better to not do so.

The condvar is only used when that special case is triggered.  Some memory
ordering trick is needed to guarantee it from happening (against the
preempt thread status field), so the main thread will always get a kick
when that triggers correctly.

Closes: https://gitlab.com/qemu-project/qemu/-/issues/1886
Debugged-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h    | 13 ++++++++++++-
 migration/migration.c    |  1 +
 migration/postcopy-ram.c | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/migration/migration.h b/migration/migration.h
index c390500604..cdaa10d515 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -196,7 +196,10 @@ struct MigrationIncomingState {
 
     /* A tree of pages that we requested to the source VM */
     GTree *page_requested;
-    /* For debugging purpose only, but would be nice to keep */
+    /*
+     * For postcopy only, count the number of requested page faults that
+     * still haven't been resolved.
+     */
     int page_requested_count;
     /*
      * The mutex helps to maintain the requested pages that we sent to the
@@ -210,6 +213,14 @@ struct MigrationIncomingState {
      * contains valid information.
      */
     QemuMutex page_request_mutex;
+    /*
+     * If postcopy preempt is enabled, there is a chance that the main
+     * thread finished loading its data before the preempt channel has
+     * finished loading the urgent pages.  If that happens, the two threads
+     * will use this condvar to synchronize, so the main thread will always
+     * wait until all pages received.
+     */
+    QemuCond page_request_cond;
 
     /*
      * Number of devices that have yet to approve switchover. When this reaches
diff --git a/migration/migration.c b/migration/migration.c
index d61e572742..e86b3f7368 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -153,6 +153,7 @@ void migration_object_init(void)
     qemu_sem_init(&current_incoming->postcopy_qemufile_dst_done, 0);
 
     qemu_mutex_init(&current_incoming->page_request_mutex);
+    qemu_cond_init(&current_incoming->page_request_cond);
     current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
 
     migration_object_check(current_migration, &error_fatal);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 29aea9456d..dbf02cd3ed 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -599,6 +599,30 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState 
*mis)
     if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
         /* Notify the fast load thread to quit */
         mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
+        /*
+         * Update preempt_thread_status before reading count.  Note: mutex
+         * lock only provide ACQUIRE semantic, and it doesn't stops this
+         * write to be reordered after reading the count.
+         */
+        smp_mb();
+        /*
+         * It's possible that the preempt thread is still handling the last
+         * pages to arrive which were requested by guest page faults.
+         * Making sure nothing is left behind by waiting on the condvar if
+         * that unlikely case happened.
+         */
+        WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
+            if (qatomic_read(&mis->page_requested_count)) {
+                /*
+                 * It is guaranteed to receive a signal later, because the
+                 * count>0 now, so it's destined to be decreased to zero
+                 * very soon by the preempt thread.
+                 */
+                qemu_cond_wait(&mis->page_request_cond,
+                               &mis->page_request_mutex);
+            }
+        }
+        /* Notify the fast load thread to quit */
         if (mis->postcopy_qemufile_dst) {
             qemu_file_shutdown(mis->postcopy_qemufile_dst);
         }
@@ -1279,6 +1303,17 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState 
*mis, void *host_addr,
             g_tree_remove(mis->page_requested, host_addr);
             mis->page_requested_count--;
             trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
+            /* Order the update of count and read of preempt status */
+            smp_mb();
+            if (qatomic_read(&mis->preempt_thread_status) ==
+                PREEMPT_THREAD_QUIT) {
+                /*
+                 * This probably means the main thread is waiting for us.
+                 * Notify that we've finished receiving the last requested
+                 * page.
+                 */
+                qemu_cond_signal(&mis->page_request_cond);
+            }
         }
         qemu_mutex_unlock(&mis->page_request_mutex);
         mark_postcopy_blocktime_end((uintptr_t)host_addr);
-- 
2.41.0
[Prev in Thread]
Current Thread
[Next in Thread]
Re: QEMU migration-test CI intermittent failure, (continued)
Prev by Date: [PATCH 2/2] hw/intc/apic: Pass CPU using QOM link property
Next by Date: [PATCH v2 0/3] accel/tcg: Alternate fix for #1866
Previous by thread: Re: QEMU migration-test CI intermittent failure
Next by thread: Re: QEMU migration-test CI intermittent failure
Index(es):
- Date
- Thread