qemu-s390x
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v


From: Avihai Horon
Subject: Re: [PATCH v3 12/17] vfio/migration: Implement VFIO migration protocol v2
Date: Sun, 20 Nov 2022 10:46:13 +0200
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.4.1


On 17/11/2022 19:24, Jason Gunthorpe wrote:
On Thu, Nov 17, 2022 at 07:07:10PM +0200, Avihai Horon wrote:
+    }
+
+    if (mig_state->data_fd != -1) {
+        if (migration->data_fd != -1) {
+            /*
+             * This can happen if the device is asynchronously reset and
+             * terminates a data transfer.
+             */
+            error_report("%s: data_fd out of sync", vbasedev->name);
+            close(mig_state->data_fd);
+
+            return -1;
Should we go to recover_state here?  Is migration->device_state
invalid?  -EBADF?
Yes, we should.
Although VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE ioctl above succeeded, setting
the device state didn't *really* succeed, as the data_fd went out of sync.
So we should go to recover_state and return -EBADF.
The state did succeed and it is now "new_state". Getting an
unexpected data_fd means it did something like RUNNING->PRE_COPY_P2P
when the code was expecting PRE_COPY->PRE_COPY_P2P.

It is actually in PRE_COPY_P2P but the in-progress migration must be
stopped and the kernel would have made the migration->data_fd
permanently return some error when it went async to RUNNING.

The recovery is to resart the migration (of this device?) from the
start.

Yes, and restart is what's happening here - the -EBADF that we return here will cause the migration to be aborted. I didn't mean that we should go to recover_state *instead* of returning an error.

But rethinking about it, I think you are right - recover_state should be used only if we can't set the device in new_state (i.e., there is some error in device functionality). In the "data_fd out of sync" case, we did set the device in new_state (no error in device functionality), but data_fd got mixed up, so we should just abort migration and restart it again.

So bottom line, I think we should just return -EBADF here to abort migration.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]