[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking wit
From: |
Joao Martins |
Subject: |
Re: [PATCH v4 12/15] vfio/common: Support device dirty page tracking with vIOMMU |
Date: |
Fri, 8 Sep 2023 11:11:49 +0100 |
On 08/09/2023 07:11, Duan, Zhenzhong wrote:
> Hi Joao,
>
> On 6/23/2023 5:48 AM, Joao Martins wrote:
>> Currently, device dirty page tracking with vIOMMU is not supported,
>> and a blocker is added and the migration is prevented.
>>
>> When vIOMMU is used, IOVA ranges are DMA mapped/unmapped on the fly as
>> requesting by the vIOMMU. These IOVA ranges can potentially be mapped
>> anywhere in the vIOMMU IOVA space as advertised by the VMM.
>>
>> To support device dirty tracking when vIOMMU enabled instead create the
>> dirty ranges based on the vIOMMU provided limits, which leads to the
>> tracking of the whole IOVA space regardless of what devices use.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> include/hw/vfio/vfio-common.h | 1 +
>> hw/vfio/common.c | 58 +++++++++++++++++++++++++++++------
>> hw/vfio/pci.c | 7 +++++
>> 3 files changed, 56 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index f41860988d6b..c4bafad084b4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -71,6 +71,7 @@ typedef struct VFIOMigration {
>> typedef struct VFIOAddressSpace {
>> AddressSpace *as;
>> bool no_dma_translation;
>> + hwaddr max_iova;
>> QLIST_HEAD(, VFIOContainer) containers;
>> QLIST_ENTRY(VFIOAddressSpace) list;
>> } VFIOAddressSpace;
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ecfb9afb3fb6..85fddef24026 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -428,6 +428,25 @@ static bool vfio_viommu_preset(void)
>> return false;
>> }
>> +static int vfio_viommu_get_max_iova(hwaddr *max_iova)
>> +{
>> + VFIOAddressSpace *space;
>> +
>> + *max_iova = 0;
>> +
>> + QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> + if (space->as == &address_space_memory) {
>> + continue;
>> + }
>
> Just curious why address_space_memory is bypassed?
>
But address_space_memory part is done by memory listeners, but I see your point
conceptually on not considering it
> Imagine two vfio devices linked to two host bridge, one has bypass_iommu set
>
> and the other not. Don't we need to include the guest memory ranges in
>
> address_space_memory?
I am probably making a bad assumption that vIOMMU maximum adress space is a
superset of the CPU address space. But as you just reminded me, there is a user
case where all it takes is one bypass_iommu=true on a 2T guest setup with
aw-bits=39.
I'll rework this to consider this.
>
>> +
>> + if (*max_iova < space->max_iova) {
>> + *max_iova = space->max_iova;
>> + }
>> + }
>> +
>> + return *max_iova == 0;
>> +}
>> +
>> int vfio_block_giommu_migration(Error **errp)
>> {
>> int ret;
>> @@ -1464,10 +1483,11 @@ static const MemoryListener
>> vfio_dirty_tracking_listener = {
>> .region_add = vfio_listener_dirty_tracking_update,
>> };
>> -static void vfio_dirty_tracking_init(VFIOContainer *container,
>> +static int vfio_dirty_tracking_init(VFIOContainer *container,
>> VFIODirtyRanges *ranges)
>> {
>> VFIODirtyRangesListener dirty;
>> + int ret;
>> memset(&dirty, 0, sizeof(dirty));
>> dirty.ranges.min32 = UINT32_MAX;
>> @@ -1475,17 +1495,29 @@ static void vfio_dirty_tracking_init(VFIOContainer
>> *container,
>> dirty.listener = vfio_dirty_tracking_listener;
>> dirty.container = container;
>> - memory_listener_register(&dirty.listener,
>> - container->space->as);
>> + if (vfio_viommu_preset()) {
>> + hwaddr iommu_max_iova;
>> +
>> + ret = vfio_viommu_get_max_iova(&iommu_max_iova);
>> + if (ret) {
>> + return -EINVAL;
>> + }
>> +
>> + vfio_dirty_tracking_update(0, iommu_max_iova, &dirty.ranges);
>> + } else {
>> + memory_listener_register(&dirty.listener,
>> + container->space->as);
>> + /*
>> + * The memory listener is synchronous, and used to calculate the
>> range
>> + * to dirty tracking. Unregister it after we are done as we are not
>> + * interested in any follow-up updates.
>> + */
>> + memory_listener_unregister(&dirty.listener);
>> + }
>> *ranges = dirty.ranges;
>> - /*
>> - * The memory listener is synchronous, and used to calculate the range
>> - * to dirty tracking. Unregister it after we are done as we are not
>> - * interested in any follow-up updates.
>> - */
>> - memory_listener_unregister(&dirty.listener);
>> + return 0;
>> }
>> static void vfio_devices_dma_logging_stop(VFIOContainer *container)
>> @@ -1590,7 +1622,13 @@ static int
>> vfio_devices_dma_logging_start(VFIOContainer
>> *container)
>> VFIOGroup *group;
>> int ret = 0;
>> - vfio_dirty_tracking_init(container, &ranges);
>> + ret = vfio_dirty_tracking_init(container, &ranges);
>> + if (ret) {
>> + error_report("Failed to init DMA logging ranges, err %d",
>> + ret);
>> + return -EOPNOTSUPP;
>> + }
>> +
>> feature = vfio_device_feature_dma_logging_start_create(container,
>> &ranges);
>
> No clear how much dirty range size could impact device dirty tracking.
>
> Maybe some devices linking to vIOMMU with small aw_bits or bypassing vIOMMU
>
So, my intended usecase with this series starts with DMA_TRANSLATION=off, where
vIOMMU is restricted to interrupt remapping, yet guest only uses it for
interrupt remapping. Right now only supported by intel-iommu, but my
understanding of AMD IOMMU is that is also possible there (haven't checked
smmu-v3). This unblocks guests with more >255. I have another patch separate to
this that hopefully relaxes vIOMMU blockage for these older guests.
Now with real emulated vIOMMU DMA translation usage, intel-iommu there's only
39, 48, 57 address space width (mimmicing the levels). And it's true that only
the first value is supportable and somewhat limiting as you say.
virtio-iommu has more going there as you can limit input iova to be the size of
the CPU address space. Eric latest series[0] is actually nice on that end of
fixing that issue (my alternative I had there was to introduce a gaw-bits
equivalent, but it's better done like [0]).
[0]
20230904080451.424731-1-eric.auger@redhat.com/">https://lore.kernel.org/qemu-devel/20230904080451.424731-1-eric.auger@redhat.com/
> with small guest memory could benefit if we use per address space's dirty
> range
>
Yeah, I'll need to fix that, when there's big guest memory and small vIOMMU
address space. Hopefully I picked up on your comment right :)