qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 1/1] hw/pci: Disable PCI_ERR_UNCOR_MASK register for machi


From: Juan Quintela
Subject: Re: [PATCH v1 1/1] hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0
Date: Fri, 26 May 2023 09:55:22 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)

Jiri Denemark <jdenemar@redhat.com> wrote:
> On Thu, May 11, 2023 at 13:43:47 +0200, Juan Quintela wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> 
>> [Added libvirt people to the party, see the end of the message ]
>
> Sorry, I'm not that much into parties :-)
>
>> That would fix the:
>> 
>> qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
>> 
>> It is worth it?  Dunno.  That is my question.
>> 
>> And knowing from what qemu it has migrated from would not help.  We
>> would need to add a new tweak and means:
>> 
>> This is a pc-7.2 machine that has been isntantiated in a qemu-8.0 and
>> has the pciaerr bug.  But wait, we have _that_.
>> 
>> And it is called
>> 
>> +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
>> 
>> from the patch.
>> 
>> We can teach libvirt about this glitch, and if he is migrating a pc-7.2
>> machine in qemu-8.0 machine, And they want to migrate to a new qemu
>> (call it qemu-8.1), it needs to be started:
>> 
>> qemu-8.1 -M pc-7.2 <whatever pci devices need to 
>> do>,x-pci-err-unc-mask="true"
>> 
>> Until the user reboots it and then that property can be reset to default
>> value.
>
> Hmm and what would happen if eventually this machine gets migrated back
> to qemu-8.0?

It works.
migrating to qemu-7.2 is what is not going to work.
To migrate to qemu-8.0, you just need to drop the
"x-pci-err-unc-mask=true" bit.  And it would work.

So, to be clear, this machine can migrate to:

- qemu-8.0, you just need to drop the "x-pci-err-unc-mask=true" bit

- qemu-8.0.1 or newer, you just need to maintain the
  "x-pci-err-unc-mask=true" bit.

Let's just assume that qemu-7.2.1 don't get the
"x-pci-err-unc-mask=true" bit, so it will not be able to migrate there.


> Or even when the machine is stopped, started again, and
> then migrated to qemu-8.0?

If you do what I call a hard reset (i.e. poweroff + poweron so qemu
dies), you should drop the "x-pci-err-unc-mask=true" bit.  And then you
can migrate to qemu-7.2 and all qemu-8.0.1 and newer.

Basically what we need is a "mark" inside libvirt that means something
like:

- this is weird machine that looks like pc-7.2
- but has "x-pci-err-unc-mask=true"
- so it can only migrate to qemu-8.0 and newer.
- but if it even reboots in qemu-8.0.1 or newer, we want it back to
  become a "normal" pc-7.2 machine (i.e. drop the
  x-pci-err-unc-mask=true).

That would be the perfect world.  But as we are in an imperfect world,
something like:

- this machine started in qemu-8.0 -M pc-7.2, we know this is broken and
  it can't migrate outside of qemu-8.0 because it would fail to go to
  either qemu-7.2 or qemu-8.0.1.

I would argue that if you do the second option doing the "right" option
i.e. the first one is not much more complicated, but that is a question
that you should be better to answer.

And then we have the other Michael question.  How can we export that
information so libvirt can use it.

In this case we can comunicate libvirt:
- In qemu-8.0 we broke pc-7.2.
- The problem is fixed in qemu-8.0.1 using property
  "x-pci-err-unc-mask=false".
- You can migrate from qemu-8.0 in newer if you set that property as
  true.
- Guests started in qemu-8.0 -M pc-7.2 should reboot in qemu-8.0.1 or
  newer to become "normal pc-7.2".
- If we publish this on qemu, we can only publish it on qemu-8.0.1 and
  newer.
- Or we can publish it somewhere else and any libvirt can take this
  information.
- Or we can comunicate this to libvirt, and they incorporate it on their
  source anywhere that you see fit.

The point here is that when we use a property on a machine type, it can
be for two reasons:

- We detected at the right time that we changed the value of something,
  and we did the right thing on hw_compat_X_Y, so libvirt needs to do
  nothing.

- We *DID NOT* detect that we broke compatibility before release, and we
  need to make a property to identify that problem.  This is where we
  need to do this dance.

Notice that normally we detect lots of problems during development and
this *should* not happen.  But when it happens, we need to be able to do
something.

And also notice that normally we broke just some device, not a whole
machine type.  But as you can see we have broke it this time.  We are
trying to automate the detection of this kind of failures, but we are
still on design stage, so we need to plan how to handle this.

Any comments?

Later, Juan.








reply via email to

[Prev in Thread] Current Thread [Next in Thread]