[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device
From: |
Kirti Wankhede |
Subject: |
Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state |
Date: |
Thu, 14 Nov 2019 00:32:55 +0530 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.1.2 |
On 11/13/2019 8:53 AM, Yan Zhao wrote:
On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
On Tue, 12 Nov 2019 22:33:36 +0530
Kirti Wankhede <address@hidden> wrote:
- Defined MIGRATION region type and sub-type.
- Used 3 bits to define VFIO device states.
Bit 0 => _RUNNING
Bit 1 => _SAVING
Bit 2 => _RESUMING
Combination of these bits defines VFIO device's state during migration
_RUNNING => Normal VFIO device running state. When its reset, it
indicates _STOPPED state. when device is changed to
_STOPPED, driver should stop device before write()
returns.
_SAVING | _RUNNING => vCPUs are running, VFIO device is running but
start saving state of device i.e. pre-copy state
_SAVING => vCPUs are stopped, VFIO device should be stopped, and
s/should/must/
save device state,i.e. stop-n-copy state
_RESUMING => VFIO device resuming state.
_SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
A table might be useful here and in the uapi header to indicate valid
states:
| _RESUMING | _SAVING | _RUNNING | Description
+-----------+---------+----------+------------------------------------------
| 0 | 0 | 0 | Stopped, not saving or resuming (a)
+-----------+---------+----------+------------------------------------------
| 0 | 0 | 1 | Running, default state
+-----------+---------+----------+------------------------------------------
| 0 | 1 | 0 | Stopped, migration interface in save mode
+-----------+---------+----------+------------------------------------------
| 0 | 1 | 1 | Running, save mode interface, iterative
+-----------+---------+----------+------------------------------------------
| 1 | 0 | 0 | Stopped, migration resume interface active
+-----------+---------+----------+------------------------------------------
| 1 | 0 | 1 | Invalid (b)
+-----------+---------+----------+------------------------------------------
| 1 | 1 | 0 | Invalid (c)
+-----------+---------+----------+------------------------------------------
| 1 | 1 | 1 | Invalid (d)
I think we need to consider whether we define (a) as generally
available, for instance we might want to use it for diagnostics or a
fatal error condition outside of migration.
We have to set it as init state. I'll add this it.
Are there hidden assumptions between state transitions here or are
there specific next possible state diagrams that we need to include as
well?
I'm curious if Intel agrees with the states marked invalid with their
push for post-copy support.
hi Alex and Kirti,
Actually, for postcopy, I think we anyway need an extra POSTCOPY state
introduced. Reasons as below:
- in the target side, _RSESUMING state is set in the beginning of
migration. It cannot be used as a state to inform device of that
currently it's in postcopy state and device DMAs are to be trapped and
pre-faulted.
we also cannot use state (_RESUMING + _RUNNING) as an indicator of
postcopy, because before device & vm running in target side, some device
state are already loaded (e.g. page tables, pending workloads),
target side can do pre-pagefault at that period before target vm up.
- in the source side, after device is stopped, postcopy needs saving
device state only (as compared to device state + remaining dirty
pages in precopy). state (!_RUNNING + _SAVING) here again cannot
differentiate precopy and postcopy here.
Bits 3 - 31 are reserved for future use. User should perform
read-modify-write operation on this field.
- Defined vfio_device_migration_info structure which will be placed at 0th
offset of migration region to get/set VFIO device related information.
Defined members of structure and usage on read/write access:
* device_state: (read/write)
To convey VFIO device state to be transitioned to. Only 3 bits are
used as of now, Bits 3 - 31 are reserved for future use.
* pending bytes: (read only)
To get pending bytes yet to be migrated for VFIO device.
* data_offset: (read only)
To get data offset in migration region from where data exist
during _SAVING and from where data should be written by user space
application during _RESUMING state.
* data_size: (read/write)
To get and set size in bytes of data copied in migration region
during _SAVING and _RESUMING state.
Migration region looks like:
------------------------------------------------------------------
|vfio_device_migration_info| data section |
| | /////////////////////////////// |
------------------------------------------------------------------
^ ^
offset 0-trapped part data_offset
Structure vfio_device_migration_info is always followed by data section
in the region, so data_offset will always be non-0. Offset from where data
to be copied is decided by kernel driver, data section can be trapped or
mapped depending on how kernel driver defines data section.
Data section partition can be defined as mapped by sparse mmap capability.
If mmapped, then data_offset should be page aligned, where as initial
section which contain vfio_device_migration_info structure might not end
at offset which is page aligned.
Vendor driver should decide whether to partition data section and how to
partition the data section. Vendor driver should return data_offset
accordingly.
For user application, data is opaque. User should write data in the same
order as received.
Signed-off-by: Kirti Wankhede <address@hidden>
Reviewed-by: Neo Jia <address@hidden>
---
include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..35b09427ad9f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
#define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff)
#define VFIO_REGION_TYPE_GFX (1)
#define VFIO_REGION_TYPE_CCW (2)
+#define VFIO_REGION_TYPE_MIGRATION (3)
/* sub-types for VFIO_REGION_TYPE_PCI_* */
@@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
/* sub-types for VFIO_REGION_TYPE_CCW */
#define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1)
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION (1)
+
+/*
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related
migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ * To indicate vendor driver the state VFIO device should be transitioned
+ * to. If device state transition fails, write on this field return error.
+ * It consists of 3 bits:
+ * - If bit 0 set, indicates _RUNNING state. When its reset, that
indicates
Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
ambiguous.
Ok. Updating it.
+ * _STOPPED state. When device is changed to _STOPPED, driver should
stop
+ * device before write() returns.
+ * - If bit 1 set, indicates _SAVING state. When set, that indicates
driver
+ * should start gathering device state information which will be
provided
+ * to VFIO user space application to save device's state.
+ * - If bit 2 set, indicates _RESUMING state. When set, that indicates
+ * prepare to resume device, data provided through migration region
+ * should be used to resume device.
+ * Bits 3 - 31 are reserved for future use. User should perform
+ * read-modify-write operation on this field.
+ * _SAVING and _RESUMING bits set at the same time is invalid state.
+ * Similarly _RUNNING and _RESUMING bits set is invalid state.
+ *
+ * pending bytes: (read only)
+ * Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ * User application should read data_offset in migration region from where
+ * user application should read device data during _SAVING state or write
+ * device data during _RESUMING state. See below for detail of sequence to
+ * be followed.
+ *
+ * data_size: (read/write)
+ * User application should read data_size to get size of data copied in
+ * bytes in migration region during _SAVING state and write size of data
+ * copied in bytes in migration region during _RESUMING state.
+ *
+ * Migration region looks like:
+ * ------------------------------------------------------------------
+ * |vfio_device_migration_info| data section |
+ * | | /////////////////////////////// |
+ * ------------------------------------------------------------------
+ * ^ ^
+ * offset 0-trapped part data_offset
+ *
+ * Structure vfio_device_migration_info is always followed by data section in
+ * the region, so data_offset will always be non-0. Offset from where data is
+ * copied is decided by kernel driver, data section can be trapped or mapped
+ * or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial
section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned.
"The user is not required to to access via mmap regardless of the
region mmap capabilities."
But once the user decides to access via mmap, it has to read data of
data_size each time, otherwise the vendor driver has no idea of how many
data are already read from user. Agree?
that's right.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to write data to staging
buffer.
+ * Kernel driver should return this read operation only after writing data
to
+ * staging buffer is done.
May I know under what condition this data_offset will be changed per
each iteration from a-f ?
Its upto vendor driver, if vendor driver maintains multiple partitions
in data section.
"staging buffer" implies a vendor driver implementation, perhaps we
could just state that data is available from (region + data_offset) to
(region + data_offset + data_size) upon return of this read operation.
Makes sense. Updating it.
+ * c. read data_size, amount of data in bytes written by vendor driver in
+ * migration region.
+ * d. read data_size bytes of data from data_offset in the migration region.
+ * e. process data.
+ * f. Loop through a to e. Next read on pending_bytes indicates that read data
+ * operation from migration region for previous iteration is done.
I think this indicate that step (f) should be to read pending_bytes, the
read sequence is not complete until this step. Optionally the user can
then proceed to step (b). There are no read side-effects of (a) afaict.
I tried to reword this sequence to be more specific:
* Sequence to be followed for _SAVING|_RUNNING device state or pre-copy
* phase and for _SAVING device state or stop-and-copy phase:
* a. read pending_bytes, indicates start of new iteration to get device
* data. If there was previous iteration, then this read operation
* indicates previous iteration is done. If pending_bytes > 0, go
* through below steps.
* b. read data_offset, indicates kernel driver to make data available
* through data section. Kernel driver should return this read
* operation only after data is available from (region + data_offset)
* to (region + data_offset + data_size).
* c. read data_size, amount of data in bytes available through migration
* region.
* d. read data of data_size bytes from (region + data_offset) from
* migration region.
* e. process data.
* f. Loop through a to e.
Hope this is more clear.
Is the use required to reach pending_bytes == 0 before changing
device_state, particularly transitioning to !_RUNNING?
No, its not necessary to reach till pending_bytes==0 in pre-copy phase.
Presumably the
user can exit this sequence at any time by clearing _SAVING.
In that case device state data is not complete, which will result in not
able to resume device with that data.
In stop-and-copy phase, user should iterate till pending_bytes is 0.
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
before proceed to step b, need to read data_size from vendor driver to determine
the max len of data to write. I think it's necessary in such a condition
that source vendor driver and target vendor driver do not offer data sections of
the same size. e.g. in source side, the data section is of size 100M,
but in target side, the data section is only of 50M size.
rather than simply fail, loop and write seems better.
Makes sense. Doing this change for next version.
Thanks
Yan
+ * b. write data of data_size to migration region from data_offset.
+ * c. write data_size which indicates vendor driver that data is written in
+ * staging buffer. Vendor driver should read this data from migration
+ * region and resume device's state.
The device defaults to _RUNNING state, so a prerequisite is to set
_RESUMING and clear _RUNNING, right?
Yes.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+ __u32 device_state; /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING (1 << 2)
+#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \
+ VFIO_DEVICE_STATE_SAVING | \
+ VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE1 (VFIO_DEVICE_STATE_SAVING | \
+ VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE2 (VFIO_DEVICE_STATE_RUNNING | \
+ VFIO_DEVICE_STATE_RESUMING)
These seem difficult to use, maybe we just need a
VFIO_DEVICE_STATE_VALID macro?
#define VFIO_DEVICE_STATE_VALID(state) \
(state & VFIO_DEVICE_STATE_RESUMING ? \
(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
This will not be work when use of other bits gets added in future.
That's the reason I preferred to add individual invalid states which
user should check.
Thanks,
Kirti
Thanks,
Alex
+ __u32 reserved;
+ __u64 pending_bytes;
+ __u64 data_offset;
+ __u64 data_size;
+} __attribute__((packed));
+
/*
* The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
* which allows direct access to non-MSIX registers which happened to be
within
- [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices, Kirti Wankhede, 2019/11/12
- Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state, Cornelia Huck, 2019/11/13
- Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state, Alex Williamson, 2019/11/13
- Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state, Kirti Wankhede, 2019/11/13
- Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state, Alex Williamson, 2019/11/13
- Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state, Kirti Wankhede, 2019/11/13
- Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state, Alex Williamson, 2019/11/13
- Re: [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state, Kirti Wankhede, 2019/11/14
[PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap., Kirti Wankhede, 2019/11/12