|
From: | Avihai Horon |
Subject: | Re: [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support |
Date: | Wed, 24 May 2023 15:49:16 +0300 |
User-agent: | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 |
On 23/05/2023 17:56, Cédric Le Goater wrote:
External email: Use caution opening links or attachments Hello Avihai, On 5/21/23 17:18, Avihai Horon wrote:Pre-copy support allows the VFIO device data to be transferred while the VM is running. This helps to accommodate VFIO devices that have a large amount of data that needs to be transferred, and it can reduce migration downtime. Pre-copy support is optional in VFIO migration protocol v2. Implement pre-copy of VFIO migration protocol v2 and use it for devices that support it. Full description of it can be found here [1]. In addition, add a new VFIO device property x-allow-pre-copy to keep migration compatibility to/from older QEMU versions that don't have VFIO pre-copy support. [1] https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/May be simply reference Linux commit 4db52602a607 ("vfio: Extend the devicemigration protocol with PRE_COPY") instead.
Sure, I will change it.
some comments below,Signed-off-by: Avihai Horon <avihaih@nvidia.com> --- docs/devel/vfio-migration.rst | 35 +++++--- include/hw/vfio/vfio-common.h | 4 + hw/core/machine.c | 1 + hw/vfio/common.c | 6 +- hw/vfio/migration.c | 163 ++++++++++++++++++++++++++++++++-- hw/vfio/pci.c | 2 + hw/vfio/trace-events | 4 +- 7 files changed, 193 insertions(+), 22 deletions(-)diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rstindex 1b68ccf115..e896b2a673 100644 --- a/docs/devel/vfio-migration.rst +++ b/docs/devel/vfio-migration.rst@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the destination host. This document details how saving and restoring of VFIOdevices is done in QEMU.-Migration of VFIO devices currently consists of a single stop-and-copy phase. -During the stop-and-copy phase the guest is stopped and the entire VFIO device-data is transferred to the destination. --The pre-copy phase of migration is currently not supported for VFIO devices.-Support for VFIO pre-copy will be added later on.+Migration of VFIO devices consists of two phases: the optional pre-copy phase, +and the stop-and-copy phase. The pre-copy phase is iterative and allows to +accommodate VFIO devices that have a large amount of data that needs to be +transferred. The iterative pre-copy phase of migration allows for the guest to +continue whilst the VFIO device state is transferred to the destination, this +helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the +VFIO_DEVICE_FEATURE_MIGRATION ioctl.Note that currently VFIO migration is supported only for a single device. This is due to VFIO migration's lack of P2P support. However, P2P support is planned @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows: * A ``load_setup`` function that sets the VFIO device on the destination in_RESUMING state. +* A ``state_pending_estimate`` function that reports an estimate of the+ remaining pre-copy data that the vendor driver has yet to save for the VFIO+ device. +* A ``state_pending_exact`` function that reads pending_bytes from the vendor driver, which indicates the amount of data that the vendor driver has yet tosave for the VFIO device.+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is+ active only when the VFIO device is in pre-copy states. ++* A ``save_live_iterate`` function that reads the VFIO device's data from the+ vendor driver during iterative pre-copy phase. +* A ``save_state`` function to save the device config space if it is present.* A ``save_live_complete_precopy`` function that sets the VFIO device in@@ -111,8 +123,10 @@ Flow of state changes during Live migration =========================================== Below is the flow of state change during live migration.-The values in the brackets represent the VM state, the migration state, and +The values in the parentheses represent the VM state, the migration state, andthe VFIO device state, respectively.+The text in the square brackets represents the flow if the VFIO device supports+pre-copy. Live migration save path ------------------------ @@ -124,11 +138,12 @@ Live migration save path | migrate_init spawns migration_threadMigration thread then calls each device's .save_setup()- (RUNNING, _SETUP, _RUNNING) + (RUNNING, _SETUP, _RUNNING [_PRE_COPY]) | - (RUNNING, _ACTIVE, _RUNNING)- If device is active, get pending_bytes by .state_pending_exact()+ (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])+ If device is active, get pending_bytes by .state_pending_{estimate,exact}() If total pending_bytes >= threshold_size, call .save_live_iterate()+ [Data of VFIO device for pre-copy phase is copied]Iterate till total pending bytes converge and are less than threshold|On migration completion, vCPU stops and calls .save_live_complete_precopy for diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.hindex eed244f25f..5ce7a01d56 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -66,6 +66,9 @@ typedef struct VFIOMigration { int data_fd; void *data_buffer; size_t data_buffer_size; + uint64_t precopy_init_size; + uint64_t precopy_dirty_size; + uint64_t mig_flags;It would have been cleaner to introduce VFIOMigration::mig_flags and its update in another patch. This is minor.
Sure, I will split it.
} VFIOMigration; typedef struct VFIOAddressSpace { @@ -143,6 +146,7 @@ typedef struct VFIODevice { VFIOMigration *migration; Error *migration_blocker; OnOffAuto pre_copy_dirty_page_tracking; + bool allow_pre_copy;same comment for this bool and the associated property, because it would ease backports.
Sure. Just for general knowledge, can you explain how this could ease backports?
bool dirty_pages_supported; bool dirty_tracking; } VFIODevice; diff --git a/hw/core/machine.c b/hw/core/machine.c index 07f763eb2e..50439e5cbb 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -41,6 +41,7 @@ GlobalProperty hw_compat_8_0[] = { { "migration", "multifd-flush-after-each-section", "on"}, + { "vfio-pci", "x-allow-pre-copy", "false" }, }; const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0); diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 78358ede27..b73086e17a 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c@@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)}if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&- migration->device_state == VFIO_DEVICE_STATE_RUNNING) {+ (migration->device_state == VFIO_DEVICE_STATE_RUNNING || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {return false; } }@@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)return false; } - if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) { + if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||+ migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {continue; } else { return false; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 235978fd68..418efed019 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c@@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)return "STOP_COPY"; case VFIO_DEVICE_STATE_RESUMING: return "RESUMING"; + case VFIO_DEVICE_STATE_PRE_COPY: + return "PRE_COPY"; default: return "UNKNOWN STATE"; }@@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,return 0; } +static int vfio_query_precopy_size(VFIOMigration *migration) +{ + struct vfio_precopy_info precopy = { + .argsz = sizeof(precopy), + };May be move here : migration->precopy_init_size = 0; migration->precopy_dirty_size = 0; since the values are reset always before calling vfio_query_precopy_size()
OK.I will also reset these values in vfio_save_cleanup() so there won't be stale values in case migration is cancelled or fails.
++ if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {+ return -errno; + } + + migration->precopy_init_size = precopy.initial_bytes; + migration->precopy_dirty_size = precopy.dirty_bytes; + + return 0; +} + /* Returns the size of saved data on success and -errno on error */ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) {@@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)data_size = read(migration->data_fd, migration->data_buffer, migration->data_buffer_size); if (data_size < 0) { + /* Pre-copy emptied all the device state for now */ + if (errno == ENOMSG) {Could you explain a little more this errno please ? It looks like an API withthe VFIO PCI variant kernel driver.
Yes, it's explained in the precopy uAPI [1]. Do you want to change the comment to something like the following? /* * ENOMSG indicates that the migration data_fd has reached a temporary * "end of stream", i.e. both initial_bytes and dirty_bytes are zero. * More data may be available later in future reads. */[1] https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1084
+ return 0; + } + return -errno; } if (data_size == 0) {@@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)return qemu_file_get_error(f) ?: data_size; }+static void vfio_update_estimated_pending_data(VFIOMigration *migration,+ uint64_t data_size) +{ + if (!data_size) { + /*+ * Pre-copy emptied all the device state for now, update estimated sizes+ * accordingly. + */ + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; + + return; + } + + if (migration->precopy_init_size) {+ uint64_t init_size = MIN(migration->precopy_init_size, data_size);+ + migration->precopy_init_size -= init_size; + data_size -= init_size; + } + + migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size, + data_size);Do we have a trace event for all this data values ?+} + +static bool vfio_precopy_supported(VFIODevice *vbasedev) +{ + VFIOMigration *migration = vbasedev->migration; + + return vbasedev->allow_pre_copy && + migration->mig_flags & VFIO_MIGRATION_PRE_COPY; +} +/* ---------------------------------------------------------------------- */static int vfio_save_setup(QEMUFile *f, void *opaque)@@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)return -ENOMEM; } + if (vfio_precopy_supported(vbasedev)) { + int ret; + + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; + + switch (migration->device_state) { + case VFIO_DEVICE_STATE_RUNNING:+ ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,+ VFIO_DEVICE_STATE_RUNNING); + if (ret) { + return ret; + } + + vfio_query_precopy_size(migration); + + break; + case VFIO_DEVICE_STATE_STOP: + /* vfio_save_complete_precopy() will go to STOP_COPY */ + break; + default: + return -EINVAL; + } + } +trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque) trace_vfio_save_cleanup(vbasedev->name); }+static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,+ uint64_t *can_postcopy) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + + if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) { + return; + } + + *must_precopy += + migration->precopy_init_size + migration->precopy_dirty_size; + + trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy, + *can_postcopy, + migration->precopy_init_size, + migration->precopy_dirty_size);ok we have one :) I wonder if we should not update trace_vfio_save_iterate()also with some values.
Hmm, yes, I guess it wouldn't hurt.
+} + /** Migration size of VFIO devices can be as little as a few KBs or as big as* many GBs. This value should be big enough to cover the worst case. */ #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) -/*- * Only exact function is implemented and not estimate function. The reason is - * that during pre-copy phase of migration the estimate function is called - * repeatedly while pending RAM size is over the threshold, thus migration - * can't converge and querying the VFIO device pending data size is useless.- */static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,uint64_t *can_postcopy) { VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; /*@@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,vfio_query_stop_copy_size(vbasedev, &stop_copy_size); *must_precopy += stop_copy_size; + if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) { + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; + vfio_query_precopy_size(migration); + + *must_precopy +=+ migration->precopy_init_size + migration->precopy_dirty_size;+ } +trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,- stop_copy_size);+ stop_copy_size, migration->precopy_init_size,+ migration->precopy_dirty_size); +} + +static bool vfio_is_active_iterate(void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY; +} + +static int vfio_save_iterate(QEMUFile *f, void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + ssize_t data_size; + + data_size = vfio_save_block(f, migration); + if (data_size < 0) { + return data_size; + } + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + + vfio_update_estimated_pending_data(migration, data_size); + + trace_vfio_save_iterate(vbasedev->name); + + /*+ * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.+ * Return 1 so following handlers will not be potentially blocked.Can this condition be detected to warn the user ?
I don't think so, it depends on the kernel driver implementation. Thanks!
+ */ + return 1; } static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)@@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)ssize_t data_size; int ret; - /* We reach here with device state STOP only */ + /* We reach here with device state STOP or STOP_COPY only */ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,VFIO_DEVICE_STATE_STOP); if (ret) {@@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)static const SaveVMHandlers savevm_vfio_handlers = { .save_setup = vfio_save_setup, .save_cleanup = vfio_save_cleanup, + .state_pending_estimate = vfio_state_pending_estimate, .state_pending_exact = vfio_state_pending_exact, + .is_active_iterate = vfio_is_active_iterate, + .save_live_iterate = vfio_save_iterate, .save_live_complete_precopy = vfio_save_complete_precopy, .save_state = vfio_save_state, .load_setup = vfio_load_setup,@@ -470,13 +609,18 @@ static const SaveVMHandlers savevm_vfio_handlers = { static void vfio_vmstate_change(void *opaque, bool running, RunState state){ VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; enum vfio_device_mig_state new_state; int ret; if (running) { new_state = VFIO_DEVICE_STATE_RUNNING; } else { - new_state = VFIO_DEVICE_STATE_STOP; + new_state = + (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&+ (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?+ VFIO_DEVICE_STATE_STOP_COPY : + VFIO_DEVICE_STATE_STOP; } /* @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev) migration->vbasedev = vbasedev; migration->device_state = VFIO_DEVICE_STATE_RUNNING; migration->data_fd = -1; + migration->mig_flags = mig_flags;vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index bf27a39905..72f30ce09f 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,vbasedev.pre_copy_dirty_page_tracking, ON_OFF_AUTO_ON), + DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice, + vbasedev.allow_pre_copy, true), DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice, display, ON_OFF_AUTO_OFF), DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0), diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 646e42fd27..fd6893cb43 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events@@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"vfio_save_cleanup(const char *name) " (%s)" vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" vfio_save_device_config_state(const char *name) " (%s)" +vfio_save_iterate(const char *name) " (%s)"vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64 -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64 +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
[Prev in Thread] | Current Thread | [Next in Thread] |