qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support


From: Avihai Horon
Subject: [PATCH 01/18] vfio/migration: Add VFIO migration pre-copy support
Date: Thu, 26 Jan 2023 20:49:31 +0200

Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  29 ++++++---
 include/hw/vfio/vfio-common.h |   3 +
 hw/vfio/common.c              |   8 ++-
 hw/vfio/migration.c           | 112 ++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events          |   5 +-
 5 files changed, 140 insertions(+), 17 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1d50c2fe5f..51f5e1a537 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved 
state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices can choose to skip
+the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
+flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 A detailed description of the UAPI for VFIO device migration can be found in
 the comment for the ``vfio_device_mig_state`` structure in the header file
@@ -29,6 +31,12 @@ VFIO implements the device hooks for the iterative approach 
as follows:
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only if the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -91,8 +99,10 @@ Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -104,11 +114,12 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
              If device is active, get pending_bytes by .save_live_pending()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5f8e7a02fe..88c2194fb9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -67,7 +67,10 @@ typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t mig_flags;
     uint64_t stop_copy_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9a0dbee6b4..93b18c5e3d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -357,7 +357,9 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer 
*container)
 
             if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
                 (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+                 migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
                 return false;
             }
         }
@@ -387,7 +389,9 @@ static bool 
vfio_devices_all_running_and_mig_active(VFIOContainer *container)
             }
 
             if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P) {
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 760f667e04..2a0a663023 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -69,6 +69,10 @@ static const char *mig_state_to_str(enum 
vfio_device_mig_state state)
         return "RESUMING";
     case VFIO_DEVICE_STATE_RUNNING_P2P:
         return "RUNNING_P2P";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
+    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
+        return "PRE_COPY_P2P";
     default:
         return "UNKNOWN STATE";
     }
@@ -237,6 +241,11 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration 
*migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 1;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -260,6 +269,7 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
+    int ret;
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
@@ -273,6 +283,23 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, 
VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -287,6 +314,12 @@ static void vfio_save_cleanup(void *opaque)
 
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
+
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+    }
+
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
@@ -301,9 +334,55 @@ static void vfio_save_pending(void *opaque, uint64_t 
threshold_size,
 
     *res_precopy_only += migration->stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+        migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) {
+        if (migration->precopy_init_size) {
+            /*
+             * Initial size should be transferred during pre-copy phase so
+             * stop-copy phase will not be slowed down. Report threshold_size
+             * to force another pre-copy iteration.
+             */
+            *res_precopy_only += threshold_size;
+        } else {
+            *res_precopy_only += migration->precopy_dirty_size;
+        }
+    }
+
     trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
                             *res_postcopy_only, *res_compatible,
-                            migration->stop_copy_size);
+                            migration->stop_copy_size,
+                            migration->precopy_init_size,
+                            migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = vfio_save_block(f, migration);
+    if (ret < 0) {
+        return ret;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -312,7 +391,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
     enum vfio_device_mig_state recover_state;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     recover_state = VFIO_DEVICE_STATE_STOP;
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    recover_state);
@@ -430,6 +509,8 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
     .save_live_pending = vfio_save_pending,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -442,13 +523,19 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+              migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P) &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) 
?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     ret = vfio_migration_set_state(vbasedev, new_state,
@@ -496,6 +583,9 @@ static int vfio_migration_data_notifier(NotifierWithReturn 
*n, void *data)
 {
     VFIOMigration *migration = container_of(n, VFIOMigration, migration_data);
     VFIODevice *vbasedev = migration->vbasedev;
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
     PrecopyNotifyData *pnd = data;
 
     if (pnd->reason != PRECOPY_NOTIFY_AFTER_BITMAP_SYNC) {
@@ -515,8 +605,21 @@ static int vfio_migration_data_notifier(NotifierWithReturn 
*n, void *data)
         migration->stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
     }
 
+    if ((migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+         migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
+        if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+            migration->precopy_init_size = 0;
+            migration->precopy_dirty_size = 0;
+        } else {
+            migration->precopy_init_size = precopy.initial_bytes;
+            migration->precopy_dirty_size = precopy.dirty_bytes;
+        }
+    }
+
     trace_vfio_migration_data_notifier(vbasedev->name,
-                                       migration->stop_copy_size);
+                                       migration->stop_copy_size,
+                                       migration->precopy_init_size,
+                                       migration->precopy_dirty_size);
 
     return 0;
 }
@@ -588,6 +691,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index db9cb94952..37724579e3 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -154,7 +154,7 @@ vfio_load_cleanup(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " 
(%s) size 0x%"PRIx64" ret %d"
-vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size) " (%s) 
stopcopy size 0x%"PRIx64
+vfio_migration_data_notifier(const char *name, uint64_t stopcopy_size, 
uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size 
0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) 
state %s"
@@ -162,6 +162,7 @@ vfio_save_block(const char *name, int data_size) " (%s) 
data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
-vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, 
uint64_t compatible, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" 
postcopy 0x%"PRIx64" compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_save_iterate(const char *name) " (%s)"
+vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, 
uint64_t compatible, uint64_t stopcopy_size, uint64_t precopy_init_size, 
uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" 
compatible 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 
0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data 
buffer size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const 
char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.26.3




reply via email to

[Prev in Thread] Current Thread [Next in Thread]