[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH v2 0/6] migration/block: disk activation rewrite
From: |
Peter Xu |
Subject: |
[PATCH v2 0/6] migration/block: disk activation rewrite |
Date: |
Fri, 6 Dec 2024 18:08:32 -0500 |
CI: https://gitlab.com/peterx/qemu/-/pipelines/1577280033
(note: it's a pipeline of two patchsets, to save CI credits and time)
v1: 20241204005138.702289-1-peterx@redhat.com">https://lore.kernel.org/r/20241204005138.702289-1-peterx@redhat.com
This is v2 of the series, removing RFC tag, because my goal is to have them
(or some newer version) merged.
The major change is I merged last three patches, and did quite some changes
here and there, to make sure the global disk activation status is always
consistent. The whole idea is still the same. I say changelog won't help.
I also temporarily dropped Fabiano's ping-pong test cases to avoid
different versions floating on the list (as I know a new version is coming
at some point. Fabiano: you're taking over the 10.0 pulls, so I assume
you're aware so there's no concern on order of merges). I'll review the
test cases separately when they're ready, but this series is still tested
with that pingpong test and it keeps working.
I started looking at this problem as a whole when reviewing Fabiano's
series, especially the patch (for a QEMU crash [1]):
https://lore.kernel.org/r/20241125144612.16194-5-farosas@suse.de
The proposed patch could work, but it's unwanted to add such side effect to
migration. So I start to think about whether we can provide a cleaner
approach, because migration doesn't need the disks to be active to work at
all. Hence we should try to avoid adding a migration ABI (which doesn't
matter now, but may matter some day) into prepare phase on disk activation
status. Migration should happen with disks inactivated.
It's also a pure wish that, if bdrv_inactivate_all() could be benign to be
called even if all disks are already inactive. Then the bug is also gone.
After all, similar call on bdrv_activate_all() upon all-active disks is all
fine. I hope that wish could still be fair. But I don't know well on
block layer to say anything meaningful.
And when I was looking at that, I found more things spread all over the
place on disk activation. I decided to clean all of them up, while
hopefully fixing the QEMU crash [1] too.
For this v2, I did some more tests, I want to make sure all the past paths
keep working at least on failure or cancel races, also in postcopy failure
cases. So I did below and they all run pass (when I said "emulated" below,
I meant I hacked something to trigger those race / rare failures, because
they aren't easy to trigger with vanilla binary):
- Tested generic migrate_cancel during precopy, disk activation won't be
affected. Disk status reports correct values in tracepoints.
- Test Fabiano's ping-pong migration tests on PAUSED state VM.
- Emulated precopy failure before sending non-iterable, disk inactivation
won't happen, and also activation won't trigger after migration cleanups
(even if activation on top of activate disk is benign, I checked traces
to make sure it'll provide consistent disk status, skipping activation).
- Emulated precopy failure right after sending non-iterable. Disks will be
inactivated, but then can be reactivated properly before VM starts.
- Emulated postcopy failure when sending the packed data (which is after
disk invalidated), and making sure src VM will get back to live properly,
re-activate the disks before starting.
- Emulated concurrent migrate_cancel at the end of migration_completion()
of precopy, after disk inactivated. Disks can be reactivated properly.
NOTE: here if dest QEMU didn't quit before migrate_cancel,
bdrv_activate_all() can crash src QEMU. This behavior should be the same
before/after this patch.
Comments welcomed, thanks.
[1] https://gitlab.com/qemu-project/qemu/-/issues/2395
Peter Xu (6):
migration: Add helper to get target runstate
qmp/cont: Only activate disks if migration completed
migration/block: Make late-block-active the default
migration/block: Apply late-block-active behavior to postcopy
migration/block: Fix possible race with block_inactive
migration/block: Rewrite disk activation
include/migration/misc.h | 4 ++
migration/migration.h | 6 +-
migration/block-active.c | 94 +++++++++++++++++++++++++++
migration/colo.c | 2 +-
migration/migration.c | 136 +++++++++++++++------------------------
migration/savevm.c | 46 ++++++-------
monitor/qmp-cmds.c | 22 +++----
migration/meson.build | 1 +
migration/trace-events | 3 +
9 files changed, 188 insertions(+), 126 deletions(-)
create mode 100644 migration/block-active.c
--
2.47.0
- [PATCH v2 0/6] migration/block: disk activation rewrite,
Peter Xu <=
- [PATCH v2 1/6] migration: Add helper to get target runstate, Peter Xu, 2024/12/06
- [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed, Peter Xu, 2024/12/06
- [PATCH v2 3/6] migration/block: Make late-block-active the default, Peter Xu, 2024/12/06
- [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy, Peter Xu, 2024/12/06
- [PATCH v2 5/6] migration/block: Fix possible race with block_inactive, Peter Xu, 2024/12/06
- [PATCH v2 6/6] migration/block: Rewrite disk activation, Peter Xu, 2024/12/06