Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during re

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during re

From:	Fabiano Rosas
Subject:	Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery
Date:	Mon, 11 Sep 2023 21:31:51 -0300

Peter Xu <peterx@redhat.com> writes:

Hi, sorry it took me so long to get to this.

> Normally the postcopy recover phase should only exist for a super short
> period, that's the duration when QEMU is trying to recover from an
> interrupted postcopy migration, during which handshake will be carried out
> for continuing the procedure with state changes from PAUSED -> RECOVER ->
> POSTCOPY_ACTIVE again.
>
> Here RECOVER phase should be super small, that happens right after the
> admin specified a new but working network link for QEMU to reconnect to
> dest QEMU.
>
> However there can still be case where the channel is broken in this small
> RECOVER window.
>
> If it happens, with current code there's no way the src QEMU can got kicked
> out of RECOVER stage. No way either to retry the recover in another channel
> when established.
>
> This patch allows the RECOVER phase to fail itself too - we're mostly
> ready, just some small things missing, e.g. properly kick the main
> migration thread out when sleeping on rp_sem when we found that we're at
> RECOVER stage.  When this happens, it fails the RECOVER itself, and
> rollback to PAUSED stage.  Then the user can retry another round of
> recovery.
>
> To make it even stronger, teach QMP command migrate-pause to explicitly
> kick src/dst QEMU out when needed, so even if for some reason the migration
> thread didn't got kicked out already by a failing rethrn-path thread, the
> admin can also kick it out.
>
> This will be an super, super corner case, but still try to cover that.

It would be nice to have a test for this. Being such a corner case, it
will be hard to keep this scenario working.

I wrote two tests[1] that do the recovery each using a different URI:
1) fd: using a freshly opened file,
2) fd: using a socketpair that simply has nothing on the other end.

These might be too far from the original bug, but it seems to exercise
some of the same paths:

Scenario 1:
/x86_64/migration/postcopy/recovery/fail-twice

the stacks are:

Thread 8 (Thread 0x7fffd5ffe700 (LWP 30282) "live_migration"):
 qemu_sem_wait
 ram_dirty_bitmap_sync_all
 ram_resume_prepare
 qemu_savevm_state_resume_prepare
 postcopy_do_resume
 postcopy_pause
 migration_detect_error
 migration_thread

Thread 7 (Thread 0x7fffd67ff700 (LWP 30281) "return path"):
 qemu_sem_wait
 postcopy_pause_return_path_thread
 source_return_path_thread

This patch seems to fix it, although we cannot call qmp_migrate_recover
a second time because the mis state is now in RECOVER:

  "Migrate recover can only be run when postcopy is paused."

Do we maybe need to return the state to PAUSED, or allow
qmp_migrate_recover to run in RECOVER, like you did on the src side?


Scenario 2:
/x86_64/migration/postcopy/recovery/fail-twice/rp

Thread 8 (Thread 0x7fffd5ffe700 (LWP 30456) "live_migration"):
 qemu_sem_wait
 ram_dirty_bitmap_sync_all
 ram_resume_prepare
 qemu_savevm_state_resume_prepare
 postcopy_do_resume
 postcopy_pause
 migration_detect_error
 migration_thread

Thread 7 (Thread 0x7fffd67ff700 (LWP 30455) "return path"):
 recvmsg
 qio_channel_socket_readv
 qio_channel_readv_full
 qio_channel_read
 qemu_fill_buffer
 qemu_peek_byte
 qemu_get_byte
 qemu_get_be16
 source_return_path_thread

Here, with this patch the migration gets stuck unless we call
migrate_pause() one more time. After another round of migrate_pause +
recover, it finishes properly.


1- hacked-together test:
-->8--
>From a34685c8795799350665a880cd2ddddbf53c5812 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Mon, 11 Sep 2023 20:45:33 -0300
Subject: [PATCH] test patch

---
 tests/qtest/migration-test.c | 87 ++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 1b43df5ca7..4d9d2209c1 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -695,6 +695,7 @@ typedef struct {
     /* Postcopy specific fields */
     void *postcopy_data;
     bool postcopy_preempt;
+    int postcopy_recovery_method;
 } MigrateCommon;
 
 static int test_migrate_start(QTestState **from, QTestState **to,
@@ -1357,6 +1358,61 @@ static void test_postcopy_preempt_tls_psk(void)
 }
 #endif
 
+static void postcopy_recover_fail(QTestState *from, QTestState *to, int method)
+{
+    int src, dst;
+
+    if (method == 1) {
+        /* give it some random fd to recover */
+        g_autofree char *uri = g_strdup_printf("%s/noop", tmpfs);
+        src = dst = open(uri, O_CREAT|O_RDWR);
+    } else if (method == 2) {
+        int ret, pair1[2], pair2[2];
+
+        /* create two unrelated socketpairs */
+        ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair1);
+        g_assert_cmpint(ret, ==, 0);
+
+        ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair2);
+        g_assert_cmpint(ret, ==, 0);
+
+        /* give the guests unpaired ends of the sockets */
+        src = pair1[0];
+        dst = pair2[0];
+    }
+
+    qtest_qmp_fds_assert_success(to, &src, 1,
+                                 "{ 'execute': 'getfd',"
+                                 "  'arguments': { 'fdname': 'fd-mig' }}");
+
+    qtest_qmp_fds_assert_success(from, &dst, 1,
+                                 "{ 'execute': 'getfd',"
+                                 "  'arguments': { 'fdname': 'fd-mig' }}");
+
+    migrate_recover(to, "fd:fd-mig");
+
+    wait_for_migration_status(from, "postcopy-paused",
+                              (const char * []) { "failed", "active",
+                                  "completed", NULL });
+    migrate_qmp(from, "fd:fd-mig", "{'resume': true}");
+
+    printf("WAIT\n");
+    if (method == 2) {
+        /* This would be issued by the admin upon noticing the hang */
+        migrate_pause(from);
+    }
+
+    wait_for_migration_status(from, "postcopy-paused",
+                              (const char * []) { "failed", "active",
+                                  "completed", NULL });
+    printf("PAUSED\n");
+
+    close(src);
+    if (method == 2) {
+        close(dst);
+    }
+}
+
 static void test_postcopy_recovery_common(MigrateCommon *args)
 {
     QTestState *from, *to;
@@ -1396,6 +1452,13 @@ static void test_postcopy_recovery_common(MigrateCommon 
*args)
                               (const char * []) { "failed", "active",
                                                   "completed", NULL });
 
+    if (args->postcopy_recovery_method) {
+        /* fail the recovery */
+        postcopy_recover_fail(from, to, args->postcopy_recovery_method);
+
+        /* continue with a good recovery */
+    }
+
     /*
      * Create a new socket to emulate a new channel that is different
      * from the broken migration channel; tell the destination to
@@ -1435,6 +1498,24 @@ static void test_postcopy_recovery_compress(void)
     test_postcopy_recovery_common(&args);
 }
 
+static void test_postcopy_recovery_fail(void)
+{
+    MigrateCommon args = {
+        .postcopy_recovery_method = 1,
+    };
+
+    test_postcopy_recovery_common(&args);
+}
+
+static void test_postcopy_recovery_fail_rp(void)
+{
+    MigrateCommon args = {
+        .postcopy_recovery_method = 2,
+    };
+
+    test_postcopy_recovery_common(&args);
+}
+
 #ifdef CONFIG_GNUTLS
 static void test_postcopy_recovery_tls_psk(void)
 {
@@ -2825,6 +2906,12 @@ int main(int argc, char **argv)
             qtest_add_func("/migration/postcopy/recovery/compress/plain",
                            test_postcopy_recovery_compress);
         }
+        qtest_add_func("/migration/postcopy/recovery/fail-twice",
+                       test_postcopy_recovery_fail);
+
+        qtest_add_func("/migration/postcopy/recovery/fail-twice/rp",
+                       test_postcopy_recovery_fail_rp);
+
     }
 
     qtest_add_func("/migration/bad_dest", test_baddest);
-- 
2.35.3

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery, Fabiano Rosas <=
- Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery, Peter Xu, 2023/09/12
  - Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery, Peter Xu, 2023/09/12
  - Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery, Fabiano Rosas, 2023/09/12
    - Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery, Peter Xu, 2023/09/12

Prev by Date: Re: [PATCH v3 23/23] bsd-user: Add stubs for vadvise(), sbrk() and sstk()
Next by Date: Re: [PATCH 8/9] migration: Add migration_rp_wait|kick()
Previous by thread: Re: [PATCH 06/32] bsd-user: Add bsd-proc.c to meson.build
Next by thread: Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery
Index(es):
- Date
- Thread