[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during re
From: |
Fabiano Rosas |
Subject: |
Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery |
Date: |
Mon, 11 Sep 2023 21:31:51 -0300 |
Peter Xu <peterx@redhat.com> writes:
Hi, sorry it took me so long to get to this.
> Normally the postcopy recover phase should only exist for a super short
> period, that's the duration when QEMU is trying to recover from an
> interrupted postcopy migration, during which handshake will be carried out
> for continuing the procedure with state changes from PAUSED -> RECOVER ->
> POSTCOPY_ACTIVE again.
>
> Here RECOVER phase should be super small, that happens right after the
> admin specified a new but working network link for QEMU to reconnect to
> dest QEMU.
>
> However there can still be case where the channel is broken in this small
> RECOVER window.
>
> If it happens, with current code there's no way the src QEMU can got kicked
> out of RECOVER stage. No way either to retry the recover in another channel
> when established.
>
> This patch allows the RECOVER phase to fail itself too - we're mostly
> ready, just some small things missing, e.g. properly kick the main
> migration thread out when sleeping on rp_sem when we found that we're at
> RECOVER stage. When this happens, it fails the RECOVER itself, and
> rollback to PAUSED stage. Then the user can retry another round of
> recovery.
>
> To make it even stronger, teach QMP command migrate-pause to explicitly
> kick src/dst QEMU out when needed, so even if for some reason the migration
> thread didn't got kicked out already by a failing rethrn-path thread, the
> admin can also kick it out.
>
> This will be an super, super corner case, but still try to cover that.
It would be nice to have a test for this. Being such a corner case, it
will be hard to keep this scenario working.
I wrote two tests[1] that do the recovery each using a different URI:
1) fd: using a freshly opened file,
2) fd: using a socketpair that simply has nothing on the other end.
These might be too far from the original bug, but it seems to exercise
some of the same paths:
Scenario 1:
/x86_64/migration/postcopy/recovery/fail-twice
the stacks are:
Thread 8 (Thread 0x7fffd5ffe700 (LWP 30282) "live_migration"):
qemu_sem_wait
ram_dirty_bitmap_sync_all
ram_resume_prepare
qemu_savevm_state_resume_prepare
postcopy_do_resume
postcopy_pause
migration_detect_error
migration_thread
Thread 7 (Thread 0x7fffd67ff700 (LWP 30281) "return path"):
qemu_sem_wait
postcopy_pause_return_path_thread
source_return_path_thread
This patch seems to fix it, although we cannot call qmp_migrate_recover
a second time because the mis state is now in RECOVER:
"Migrate recover can only be run when postcopy is paused."
Do we maybe need to return the state to PAUSED, or allow
qmp_migrate_recover to run in RECOVER, like you did on the src side?
Scenario 2:
/x86_64/migration/postcopy/recovery/fail-twice/rp
Thread 8 (Thread 0x7fffd5ffe700 (LWP 30456) "live_migration"):
qemu_sem_wait
ram_dirty_bitmap_sync_all
ram_resume_prepare
qemu_savevm_state_resume_prepare
postcopy_do_resume
postcopy_pause
migration_detect_error
migration_thread
Thread 7 (Thread 0x7fffd67ff700 (LWP 30455) "return path"):
recvmsg
qio_channel_socket_readv
qio_channel_readv_full
qio_channel_read
qemu_fill_buffer
qemu_peek_byte
qemu_get_byte
qemu_get_be16
source_return_path_thread
Here, with this patch the migration gets stuck unless we call
migrate_pause() one more time. After another round of migrate_pause +
recover, it finishes properly.
1- hacked-together test:
-->8--
>From a34685c8795799350665a880cd2ddddbf53c5812 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Mon, 11 Sep 2023 20:45:33 -0300
Subject: [PATCH] test patch
---
tests/qtest/migration-test.c | 87 ++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 1b43df5ca7..4d9d2209c1 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -695,6 +695,7 @@ typedef struct {
/* Postcopy specific fields */
void *postcopy_data;
bool postcopy_preempt;
+ int postcopy_recovery_method;
} MigrateCommon;
static int test_migrate_start(QTestState **from, QTestState **to,
@@ -1357,6 +1358,61 @@ static void test_postcopy_preempt_tls_psk(void)
}
#endif
+static void postcopy_recover_fail(QTestState *from, QTestState *to, int method)
+{
+ int src, dst;
+
+ if (method == 1) {
+ /* give it some random fd to recover */
+ g_autofree char *uri = g_strdup_printf("%s/noop", tmpfs);
+ src = dst = open(uri, O_CREAT|O_RDWR);
+ } else if (method == 2) {
+ int ret, pair1[2], pair2[2];
+
+ /* create two unrelated socketpairs */
+ ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair1);
+ g_assert_cmpint(ret, ==, 0);
+
+ ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair2);
+ g_assert_cmpint(ret, ==, 0);
+
+ /* give the guests unpaired ends of the sockets */
+ src = pair1[0];
+ dst = pair2[0];
+ }
+
+ qtest_qmp_fds_assert_success(to, &src, 1,
+ "{ 'execute': 'getfd',"
+ " 'arguments': { 'fdname': 'fd-mig' }}");
+
+ qtest_qmp_fds_assert_success(from, &dst, 1,
+ "{ 'execute': 'getfd',"
+ " 'arguments': { 'fdname': 'fd-mig' }}");
+
+ migrate_recover(to, "fd:fd-mig");
+
+ wait_for_migration_status(from, "postcopy-paused",
+ (const char * []) { "failed", "active",
+ "completed", NULL });
+ migrate_qmp(from, "fd:fd-mig", "{'resume': true}");
+
+ printf("WAIT\n");
+ if (method == 2) {
+ /* This would be issued by the admin upon noticing the hang */
+ migrate_pause(from);
+ }
+
+ wait_for_migration_status(from, "postcopy-paused",
+ (const char * []) { "failed", "active",
+ "completed", NULL });
+ printf("PAUSED\n");
+
+ close(src);
+ if (method == 2) {
+ close(dst);
+ }
+}
+
static void test_postcopy_recovery_common(MigrateCommon *args)
{
QTestState *from, *to;
@@ -1396,6 +1452,13 @@ static void test_postcopy_recovery_common(MigrateCommon
*args)
(const char * []) { "failed", "active",
"completed", NULL });
+ if (args->postcopy_recovery_method) {
+ /* fail the recovery */
+ postcopy_recover_fail(from, to, args->postcopy_recovery_method);
+
+ /* continue with a good recovery */
+ }
+
/*
* Create a new socket to emulate a new channel that is different
* from the broken migration channel; tell the destination to
@@ -1435,6 +1498,24 @@ static void test_postcopy_recovery_compress(void)
test_postcopy_recovery_common(&args);
}
+static void test_postcopy_recovery_fail(void)
+{
+ MigrateCommon args = {
+ .postcopy_recovery_method = 1,
+ };
+
+ test_postcopy_recovery_common(&args);
+}
+
+static void test_postcopy_recovery_fail_rp(void)
+{
+ MigrateCommon args = {
+ .postcopy_recovery_method = 2,
+ };
+
+ test_postcopy_recovery_common(&args);
+}
+
#ifdef CONFIG_GNUTLS
static void test_postcopy_recovery_tls_psk(void)
{
@@ -2825,6 +2906,12 @@ int main(int argc, char **argv)
qtest_add_func("/migration/postcopy/recovery/compress/plain",
test_postcopy_recovery_compress);
}
+ qtest_add_func("/migration/postcopy/recovery/fail-twice",
+ test_postcopy_recovery_fail);
+
+ qtest_add_func("/migration/postcopy/recovery/fail-twice/rp",
+ test_postcopy_recovery_fail_rp);
+
}
qtest_add_func("/migration/bad_dest", test_baddest);
--
2.35.3
- Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery,
Fabiano Rosas <=