qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1] vfio/common: Separate vfio-pci ranges


From: Joao Martins
Subject: Re: [PATCH v1] vfio/common: Separate vfio-pci ranges
Date: Mon, 11 Sep 2023 20:17:32 +0100
User-agent: Mozilla Thunderbird

On 11/09/2023 19:35, Alex Williamson wrote:
> On Mon, 11 Sep 2023 11:12:55 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 11/09/2023 10:48, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Sent: Monday, September 11, 2023 5:07 PM
>>>> Subject: Re: [PATCH v1] vfio/common: Separate vfio-pci ranges
>>>>
>>>> On 11/09/2023 09:57, Duan, Zhenzhong wrote:  
>>>>>> -----Original Message-----
>>>>>> From: qemu-devel-bounces+zhenzhong.duan=intel.com@nongnu.org <qemu-
>>>>>> devel-bounces+zhenzhong.duan=intel.com@nongnu.org> On Behalf Of Joao
>>>>>> Martins
>>>>>> Sent: Friday, September 8, 2023 5:30 PM
>>>>>> Subject: [PATCH v1] vfio/common: Separate vfio-pci ranges
>>>>>>
>>>>>> QEMU computes the DMA logging ranges for two predefined ranges: 32-bit
>>>>>> and 64-bit. In the OVMF case, when the dynamic MMIO window is enabled,
>>>>>> QEMU includes in the 64-bit range the RAM regions at the lower part
>>>>>> and vfio-pci device RAM regions which are at the top of the address
>>>>>> space. This range contains a large gap and the size can be bigger than
>>>>>> the dirty tracking HW limits of some devices (MLX5 has a 2^42 limit).
>>>>>>
>>>>>> To avoid such large ranges, introduce a new PCI range covering the
>>>>>> vfio-pci device RAM regions, this only if the addresses are above 4GB
>>>>>> to avoid breaking potential SeaBIOS guests.
>>>>>>
>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>> [ clg: - wrote commit log
>>>>>>       - fixed overlapping 32-bit and PCI ranges when using SeaBIOS ]
>>>>>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>>>>>> ---
>>>>>> v2:
>>>>>> * s/minpci/minpci64/
>>>>>> * s/maxpci/maxpci64/
>>>>>> * Expand comment to cover the pci-hole64 and why we don't do special
>>>>>>  handling of pci-hole32.
>>>>>> ---
>>>>>> hw/vfio/common.c     | 71 +++++++++++++++++++++++++++++++++++++-------
>>>>>> hw/vfio/trace-events |  2 +-
>>>>>> 2 files changed, 61 insertions(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index 237101d03844..134649226d43 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -27,6 +27,7 @@
>>>>>>
>>>>>> #include "hw/vfio/vfio-common.h"
>>>>>> #include "hw/vfio/vfio.h"
>>>>>> +#include "hw/vfio/pci.h"
>>>>>> #include "exec/address-spaces.h"
>>>>>> #include "exec/memory.h"
>>>>>> #include "exec/ram_addr.h"
>>>>>> @@ -1400,6 +1401,8 @@ typedef struct VFIODirtyRanges {
>>>>>>     hwaddr max32;
>>>>>>     hwaddr min64;
>>>>>>     hwaddr max64;
>>>>>> +    hwaddr minpci64;
>>>>>> +    hwaddr maxpci64;
>>>>>> } VFIODirtyRanges;
>>>>>>
>>>>>> typedef struct VFIODirtyRangesListener {
>>>>>> @@ -1408,6 +1411,31 @@ typedef struct VFIODirtyRangesListener {
>>>>>>     MemoryListener listener;
>>>>>> } VFIODirtyRangesListener;
>>>>>>
>>>>>> +static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
>>>>>> +                                     VFIOContainer *container)
>>>>>> +{
>>>>>> +    VFIOPCIDevice *pcidev;
>>>>>> +    VFIODevice *vbasedev;
>>>>>> +    VFIOGroup *group;
>>>>>> +    Object *owner;
>>>>>> +
>>>>>> +    owner = memory_region_owner(section->mr);
>>>>>> +
>>>>>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>>>>>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>>>> +            if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +            pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
>>>>>> +            if (OBJECT(pcidev) == owner) {
>>>>>> +                return true;
>>>>>> +            }
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    return false;
>>>>>> +}  
>>>>>
>>>>> What about simplify it with memory_region_is_ram_device()?
>>>>> This way vdpa device could also be included.
>>>>>  
>>>>
>>>> Note that the check is not interested in RAM (or any other kinda of memory 
>>>> like
>>>> VGA). That's covered in the 32-64 ranges. But rather in any PCI device RAM 
>>>> that
>>>> would fall in the 64-bit PCI hole. Would memory_region_is_ram_device() 
>>>> really
>>>> cover it? If so, I am all for the simplification.  
>>>
>>> Ram device is used not only by vfio pci bars but also host notifier of vdpa 
>>> and vhost-user.
>>>
>>> I have another question, previously I think vfio pci bars are device states 
>>> and
>>> save/restored through VFIO migration protocol, so we don't need to dirty
>>> tracking them. Do I understand wrong?  
>>
>> The general thinking of device dirty tracking is to track all addressable 
>> IOVAs.
>> But you do raise a good question. My understanding is that migrating the bars
>> *as is* might be device migration specific (not a guarantee?); the save file 
>> and
>> precopy interface are the only places we transfer from/to the data and it's 
>> just
>> opaque data, not bars or anything formatted specifically -- so if we migrate
>> bars it is hidden in what device f/w wants to move. Might be that BARs aren't
>> even needed as they are sort of scratch space from h/w side. Ultimately, the
>> dirty tracker is the one reporting the values, and the device h/w chooses to 
>> not
>> report those IOVAs as dirty then nothing changes.
> 
> I think ram_device is somewhat controversial, only a few select device
> types make use of it, so for example an emulated NIC not making use of
> ram_device might still exist within the IOVA aperture of the device.
> 
> It does however seem strange to me to include the MMIO space of other
> devices in the dirty tracking of the vfio device.  I'm not sure it's
> really the vfio device's place to report MMIO within another device as
> dirty.  Saving grace being that this likely doesn't really occur
> currently, but should the vfio device even exert any resources for
> tracking such ranges?
> 
I suppose we could be strict to RAM only and its own device PCI ranges
(excluding others like VGA and etc); it makes conceptual sense. And when
mixing up devices with big RAM ranges (*if any* it may stress the dirty tracker
more than it should.

> A concern with simply trying to define RAM vs PCI ranges is that QEMU
> is pretty fluid about the VM layout.  For example, what prevents us
> from having a NUMA configuration where one memory node might be
> similarly disjoint as between RAM and PCI memory here?  TBH, I thought
> this was why we had the combine_iova_ranges() function in the kernel so
> that QEMU simply wrote through the dirty tracking ranges and the driver
> combined them as necessary based on their own constraints.

And I think it was ... as userspace can still pass very random set of ranges, so
h/w driver may need do some filtering. The initial versions took that approach
i.e. store everything, and let the kernel select what it wants to pass to
hardware. But we ended up with this dual/three range tracking to simplify the
logic as I recall. Perhaps how we started (via IOVATree) was the wrong data
structure on my end, and even switching to a simpler data struture, there will
always be the dynamical behaviour that we need to re-allocate the ranges as we 
go.

        Joao



reply via email to

[Prev in Thread] Current Thread [Next in Thread]