[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control o
From: |
Michael S. Tsirkin |
Subject: |
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register |
Date: |
Mon, 8 Jun 2015 10:59:51 +0200 |
On Mon, Jun 08, 2015 at 09:09:15AM +0100, Malcolm Crossley wrote:
> On 08/06/15 08:42, Jan Beulich wrote:
> >>>> On 07.06.15 at 08:23, <address@hidden> wrote:
> >> On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
> >>> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
> >>>>>>> On 20.04.15 at 15:43, <address@hidden> wrote:
> >>>>> On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
> >>>>>>>>> On 13.04.15 at 14:47, <address@hidden> wrote:
> >>>>>>> Can you check device capabilities register, offset 0x4 within
> >>>>>>> pci express capability structure?
> >>>>>>> Bit 15 is 15 Role-Based Error Reporting.
> >>>>>>> Is it set?
> >>>>>>>
> >>>>>>> The spec says:
> >>>>>>>
> >>>>>>> 15
> >>>>>>> On platforms where robust error handling and PC-compatible
> >>>>>>> Configuration
> >>>>>>> Space probing is
> >>>>>>> required, it is suggested that software or firmware have the
> >>>>>>> Unsupported
> >>>>>>> Request Reporting Enable
> >>>>>>> bit Set for Role-Based Error Reporting Functions, but clear for
> >>>>>>> 1.0a
> >>>>>>> Functions. Software or
> >>>>>>> firmware can distinguish the two classes of Functions by
> >>>>>>> examining the
> >>>>>>> Role-Based Error Reporting
> >>>>>>> bit in the Device Capabilities register.
> >>>>>>
> >>>>>> Yes, that bit is set.
> >>>>>
> >>>>> curiouser and curiouser.
> >>>>>
> >>>>> So with functions that do support Role-Based Error Reporting, we have
> >>>>> this:
> >>>>>
> >>>>>
> >>>>> With device Functions implementing Role-Based Error Reporting,
> >>>>> setting the
> >>>>> Unsupported Request
> >>>>> Reporting Enable bit will not interfere with PC-compatible
> >>>>> Configuration
> >>>>> Space probing, assuming
> >>>>> that the severity for UR is left at its default of non-fatal.
> >>>>> However,
> >>>>> setting the Unsupported Request
> >>>>> Reporting Enable bit will enable the Function to report UR
> >>>>> errors 97
> >>>>> detected with posted Requests,
> >>>>> helping avoid this case for potential silent data corruption.
> >>>>
> >>>> I still don't see what the PC-compatible config space probing has to
> >>>> do with our issue.
> >>>
> >>> I'm not sure but I think it's listed here because it causes a ton of URs
> >>> when device scan probes unimplemented functions.
> >>>
> >>>>> did firmware reconfigure this device to report URs as fatal errors then?
> >>>>
> >>>> No, the Unsupported Request Error Serverity flag is zero.
> >>>
> >>> OK, that's the correct configuration, so how come the box crashes when
> >>> there's a UR then?
> >>
> >> Ping - any update on this?
> >
> > Not really. All we concluded so far is that _maybe_ the bridge, upon
> > seeing the UR, generates a Master Abort, rendering the whole thing
> > fatal. Otoh the respective root port also has
> > - Received Master Abort set in its Secondary Status register (but
> > that's also already the case in the log that we have before the UR
> > occurs, i.e. that doesn't mean all that much),
> > - Received System Error set in its Secondary Status register (and
> > after the UR the sibling endpoint [UR originating from 83:00.0,
> > sibling being 83:00.1] also shows Signaled System Error set).
> >
>
> Disabling the Memory decode in the command register could also result in a
> completion timeout on the
> root port issuing a transaction towards the PCI device in question.
Can it really? Such device would violate the PCIE spec, which says:
If the request is not claimed, then it is handled as an
Unsupported Request, which is the
PCI Express equivalent of conventional PCI’s Master Abort termination.
> PCIE completion timeouts can be
> escalated to Fatal AER errors which trigger system firmware to inject NMI's
> into the host.
>
> Unsupported requests can also be escalated to be Fatal AER errors (which
> would again trigger system
> firmware to inject an NMI).
Only if the system is misconfigured. We found out the system in question
is not configured to do this.
> Here is an example AER setup for a PCIE root port. You can see UnsupReq
> errors are masked and so do
> not trigger errors. CmpltTO ( completion timeout) errors are not masked and
> the errors are treated
> as Fatal because the corresponding bit in the Uncorrectable Severity register
> is set.
>
> Capabilities: [148 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
> MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF-
> MalfTLP- ECRC- UnsupReq+ ACSViol+
> UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+
> MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>
> A root port completion timeout will also result in the master abort bit being
> set.
How do you figure this one out? The spec I have says master abort is the
equivalent of UR.
> Typically system firmware clears the error in the AER registers after it's
> processed it. So the
> operating system may not be able to determine what error triggered the NMI in
> the first place.
At least for debugging, just disable firmware and handle everything in
software.
> >> Do we can chalk this up to hardware bugs on a specific box?
> >
> > I have to admit that I'm still very uncertain whether to consider all
> > this correct behavior, a firmware flaw, or a hardware bug.
> I believe the correct behaviour is happening but a PCIE completion timeout is
> occurring instead of a
> unsupported request.
>
> Malcolm
This guess would be easy to check - just mask out the timeout bit.
>
> >
> > Jan
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > address@hidden
> > http://lists.xen.org/xen-devel
> >
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Michael S. Tsirkin, 2015/06/07
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Malcolm Crossley, 2015/06/08
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register,
Michael S. Tsirkin <=
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Jan Beulich, 2015/06/08
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Michael S. Tsirkin, 2015/06/08
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Jan Beulich, 2015/06/08
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Michael S. Tsirkin, 2015/06/08
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Jan Beulich, 2015/06/08
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Jan Beulich, 2015/06/10
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Michael S. Tsirkin, 2015/06/10
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Jan Beulich, 2015/06/10
- Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register, Michael S. Tsirkin, 2015/06/10