qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining devic


From: Alex Williamson
Subject: Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Date: Wed, 24 Jul 2019 08:43:55 -0600

On Wed, 24 Jul 2019 18:03:31 +0800
Peter Xu <address@hidden> wrote:

> On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:  
> > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:  
> > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:  
> > > > > > [Cc +Brijesh]
> > > > > > 
> > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > think the Linux code handles this regardless of the firmware 
> > > > > > provided
> > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > bridge aliases?  Thanks,
> > > > > >     
> > > > > 
> > > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > > IVRS would contain similar information.
> > > > > 
> > > > > Suravee, please correct me if I am missing something?  
> > > > 
> > > > I finally found some time to investigate this a little further, yes the
> > > > types mentioned are correct for defining start and end of an alias
> > > > range.  The challenge here is that these entries require a DeviceID,
> > > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > > numbers are defined by the guest firmware, and potentially redefined by
> > > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > > to define a new linker-loader command that would instruct the guest to
> > > > write a bus number byte to a given offset for a described device.
> > > > These commands would be inserted before the checksum command, such that
> > > > these bus number updates are calculated as part of the checksum.
> > > > 
> > > > I'm imagining the command format would need to be able to distinguish
> > > > between the actual bus number of a described device, the secondary bus
> > > > number of the device, and the subordinate bus number of the device.
> > > > For describing the device, I'm envisioning stealing from the DMAR
> > > > definition, which already includes a bus number invariant mechanism to
> > > > describe a device, starting with a segment and root bus, follow a chain
> > > > of devfns to get to the target device.  Therefore the guest firmware
> > > > would follow the path to the described device, pick the desired bus
> > > > number, and write it to the indicated table offset.
> > > > 
> > > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > > thrilled with the increased scope demanded by IVRS support, but so long
> > > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,  
> > > 
> > > I don't have a better idea yet, but just want to say that accidentally
> > > I was trying to look into this as well starting from this week and I'd
> > > say that's mostly what I thought about too (I was still reading a bit
> > > seabios when I saw this email)... so at least this idea makes sense to
> > > me.
> > > 
> > > Would the guest OS still change the PCI bus number even after the
> > > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > > 
> > > Thanks,  
> > 
> > Guest OSes can in theory rebalance resources. Changing bus numbers
> > would be useful if new bridges are added by hotplug.
> > In practice at least Linux doesn't do the rebalancing.
> > I think that if we start reporting PNP OS support in BIOS then windows
> > might start doing that more aggressively.  
> 
> It's surprising me a bit...  IMHO if we allow the bus number to change
> then at least many scripts can even fail which might work before.
> E.g. , a very common script can run "lspci-like" program to list each
> device and then do "lspci-like -vvv" again upon the BDF it fetched
> from previous commands.  Any kind of BDF caching would be invalid
> since that from either userspace or kernel.
> 
> Also, obviously the data to be stored in IVRS is closely bound to how
> bus number is defined.  Even if we can add a new linker-loader command
> to all the open firmwares like seabios or OVMF but still we can't do
> that to Windows (or, could we?...).
> 
> Now one step back, I'm also curious on the reason behind on why AMD
> spec required the IVRS with BDF information, rather than the scope
> information like what Intel DMAR spec was asking for.

It's a deficiency of the IVRS spec, but it's really out of scope here.
It's not the responsibility of the hypervisor to resolve this sort of
design issue, we should simply maintain the bare metal behavior and the
bare metal limitations of the design.

Michael did invoke some interesting ideas regarding QEMU updating the
IRVS table though.  QEMU does know when bus apertures are programmed on
devices and the config writes for these updates could trigger IVRS
updates.  I think we'd want to allow such updates only between machine
reset and the guest firmware writing the table checksum.  This reduces
the scope of the necessary changes, though still feels a little messy
to have these config writes making table updates.

Another approach, and maybe what Michael was really suggesting, is that
we essentially create the ACPI tables twice AFAICT.  Once initially,
then again via a select callback in fw_cfg.  For SeaBIOS, it looks like
this second generation would be created after the PCI bus has been
enumerated and initialized.  I've been trying to see if the same is
likely for OVMF, though it's not clear to me that this is a reasonable
ordering to rely on.  It would be entirely reasonable that firmware
could process ACPI tables in advance of enumerating PCI, even
potentially as a prerequisite to enumerating PCI.  So ultimately I'm not
sure if there are valid ordering assumptions to use these callbacks
this way, though I'd appreciate any further discussion.  Thanks,

Alex



reply via email to

[Prev in Thread] Current Thread [Next in Thread]