qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 07/11] vfio/migration: Implement VFIO migration protocol v


From: Avihai Horon
Subject: Re: [PATCH v2 07/11] vfio/migration: Implement VFIO migration protocol v2
Date: Wed, 27 Jul 2022 18:45:46 +0300
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0


On 7/18/2022 6:12 PM, Jason Gunthorpe wrote:
On Mon, May 30, 2022 at 08:07:35PM +0300, Avihai Horon wrote:

+/* Returns 1 if end-of-stream is reached, 0 if more data and -1 if error */
+static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+{
+    ssize_t data_size;
+
+    data_size = read(migration->data_fd, migration->data_buffer,
+                     migration->data_buffer_size);
+    if (data_size < 0) {
+        return -1;
+    }
+    if (data_size == 0) {
+        return 1;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+    qemu_put_be64(f, data_size);
+    qemu_put_buffer_async(f, migration->data_buffer, data_size, false);
+    qemu_fflush(f);
+    bytes_transferred += data_size;
+
+    trace_vfio_save_block(migration->vbasedev->name, data_size);
+
+    return qemu_file_get_error(f);
+}
We looked at this from an eye to "how much data is transfered" per
callback.

The above function is the basic data mover, and
'migration->data_buffer_size' is set to 1MB at the moment.

So, we product up to 1MB VFIO_MIG_FLAG_DEV_DATA_STATE sections.

This series does not include the precopy support, but that will
include a precopy 'save_live_iterate' function like this:

static int vfio_save_iterate(QEMUFile *f, void *opaque)
{
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     int ret;

     ret = vfio_save_block(f, migration);
     if (ret < 0) {
         return ret;
     }
     if (ret == 1) {
         return 1;
     }
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
     return 0;
}

Thus, during precopy this will never do more than 1MB per callback.

+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    enum vfio_device_mig_state recover_state;
+    int ret;
+
+    /* We reach here with device state STOP or STOP_COPY only */
+    recover_state = VFIO_DEVICE_STATE_STOP;
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   recover_state);
+    if (ret) {
+        return ret;
+    }
+
+    do {
+        ret = vfio_save_block(f, vbasedev->migration);
+        if (ret < 0) {
+            return ret;
+        }
+    } while (!ret);
This seems to be the main problem where we chain together 1MB blocks
until the entire completed precopy data is completed. The above is
hooked to 'save_live_complete_precopy'

So, if we want to break the above up into some 'save_iterate' like
function, do you have some advice how to do it? The above do/while
must happen after the VFIO_DEVICE_STATE_STOP_COPY.

Ping.

Juan, AFAIU (and correct me if I am wrong) the problem on source side is that save_live_complete_precopy handlers are called with iothread locked, so during this time QEMU is non-responsive. On destination side, we don't yield every now and then like RAM code does, so QEMU is non-responsive there as well.

Is it possible to solve this problem by letting the VFIO save_live_complete_precopy handler run outside the iothread lock?

For example, add a function to SaveVMHandlers that indicates whether this specific save_live_complete_precopy handler should run inside/outside iothread lock? Or add a save_live_complete_precopy_nonblocking handler that runs outside the iothread lock?

On destination side, since VFIO data is sent in chunks of 1MB, we can yield every now and then.

What do you think?

Thanks.

For mlx5 the above loop will often be ~10MB's for small VMs and
100MB's for big VMs (big meaning making extensive use of RDMA
functionality), and this will not change with pre-copy support or not.

Is it still a problem?

For other devices, like a GPU, I would imagine pre-copy support is
implemented and this will be a smaller post-precopy residual.

Jason



reply via email to

[Prev in Thread] Current Thread [Next in Thread]