[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE
From: |
William Roche |
Subject: |
Re: [PATCH v4 2/3] i386: Explicitly ignore unsupported BUS_MCEERR_AO MCE on AMD guest |
Date: |
Fri, 22 Sep 2023 18:18:39 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0 |
On 9/22/23 16:30, Yazen Ghannam wrote:
On 9/22/23 4:36 AM, William Roche wrote:
On 9/21/23 19:41, Yazen Ghannam wrote:
[...]
Also, during page migration, does the data flow through the CPU core?
Sorry for the basic question. I haven't done a lot with virtualization.
Yes, in most cases (with the exception of RDMA) the data flow through
the CPU cores because the migration verifies if the area to transfer has
some empty pages.
If the CPU moves the memory, then the data will pass through the core/L1
caches, correct? If so, then this will result in a MCE/poison
consumption/AR event in that core.
That's the entire point of this other patch I was referring to:
"Qemu crashes on VM migration after an handled memory error"
an example of a direct link:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg990803.html
The idea is to skip the pages we know are poisoned -- so we have a
chance to complete the migration without getting AR events :)
So it seems to me that migration will always cause an AR event, and the
gap you describe will not occur. Does this make sense? Sorry if I
misunderstood.
In general, the hardware is designed to detect and mark poison, and to
not let poison escape a system undetected. In the strictest case, the
hardware will perform a system reset if poison is leaving the system. In
a more graceful case, the hardware will continue to pass the poison
marker with the data, so the destination hardware will receive it. In
both cases, the goal is to avoid silent data corruption, and to do so in
the hardware, i.e. without relying on firmware or software management.
The hardware designers are very keen on this point.
For the moment virtualization needs *several* enhancements just to deal
with memory errors -- what we are currently trying to fix is a good
example of that !
BTW, the RDMA case will need further discussion. I *think* this would
fall under the "strictest" case. And likely, CPU-based migration will
also. But I think we can test this and find out. :)
The test has been done, and showed that the RDMA migration is failing
when poison exists.
But we are discussing aspects that are probably too far from our main
topic here.
Please note that current AMD systems use an internal poison marker on
memory. This cannot be cleared through normal memory operations. The
only exception, I think, is to use the CLZERO instruction. This will
completely wipe a cacheline including metadata like poison, etc.
So the hardware should not (by design) loose track of poisoned data.
This would be better, but virtualization migration currently looses
track of this.
Which is not a problem for VMs where the kernel took note of the poison
and keeps track of it. Because this kernel will handle the poison
locations it knows about, signaling when these poisoned locations are
touched.
Can you please elaborate on this? I would expect the host kernel to do
all the physical, including poison, memory management.
Yes, the host kernel does that, and the VM kernel too for its own
address space.
Or do you mean in the nested poison case like this?
1) The host detects an "AO/deferred" error.
The host Kernel is notified by the hardware of an SRAO/deferred error
2) The host can try to recover the memory, if clean, etc.
From my understanding, this is an uncorrectable error, standard case
Kernel can't "clean" the error, but keeps track of it and tries to
signal the user of the impacted memory page every-time it's needed.
3) Otherwise, the host passes the error info, with "AO/deferred" severity
to the guest.
Yes, in the case of a guest VM impacted, qemu asked to be informed of AO
events, so that the host kernel should signal it to qemu. Qemu than
relays the information (creating a virtual MCE event) that the VM Kernel
receives and deals with.
4) The guest, in nested fashion, can try to recover the memory, if
clean, etc. Or signal its own processes with the AO SIGBUS.
Here again there is no recovery: The VM kernel does the same thing as
the host kernel: memory management, possible signals, etc...
An enhancement will be to take the MCA error information collected
during the interrupt and extract useful data. For example, we'll need to
translate the reported address to a system physical address that can be
mapped to a page.
This would be great, as it would mean that a kernel running in a VM can
get notified too.
Yes, I agree.
Once we have the page, then we can decide how we want to signal the
process(es). We could get a deferred/AO error in the host, and signal the
guest with an AR. So the guest handling could be the same in both cases. >
Would this be okay? Or is it important that the guest can distinguish
between the A0/AR cases?
SIGBUS/BUS_MCEERR_AO and BUS_MCEERR_AR are not interchangeable, it is
important to distinguish them.
AO is an asynchronous signal that is only generated when the process
asked for it -- indicating that an error has been detected in its
address space but hasn't been touched yet.
Most of the processes don't care about that (and don't get notified),
they just continue to run, if the poisoned area is not touched, great.
Otherwise a BUS_MCEERR_AR signal is generated when the area is touched,
indicating that the execution thread can't access the location.
Yes, understood.
IOW, will guests have their own policies on
when to take action? Or is it more about allowing the guest to handle
the error less urgently?
Yes to both questions. Any process can indicate if it is interested to
be "early killed on MCE" or not. See proc(5) man page about
/proc/sys/vm/memory_failure_early_kill, and prctl(2) about
PR_MCE_KILL/PR_MCE_KILL_GET. Such a process could take actions before
it's too late and it would need the poisoned data.
Yes, agree. I think the "nested" case above would fall under this. Also,
an application, or software stack, with complex memory management could
benefit.
Sure -- some databases already take advantage of this mechanism for
example too ;)
In other words, having the AMD kernel to generate SIGBUS/BUS_MCEERR_AO
signals and making AMD qemu able to relay them to the VM kernel would
make things better for AMD platforms ;)
Yes, I agree. :)
So according to me, for the moment we should integrate the 3 proposed
patches, and continue to work to make:
- the AMD kernel deal better with SRAO both on the host
and the VM sides,
- in relationship with another qemu enhancement to relay the
BUS_MCEERR_AO signal so that the VM kernel deals with it too.
The reason why I started this conversation was to know if there would be
a simple way to already informed the VM kernel of an AO signal (without
crashing it) even if it is not yet able to relay the event to its own
processes. But this would prepare qemu so that when the kernel is
enhanced, it may not be necessary to modify qemu again.
The patches we are currently focusing on (Fix MCE handling on AMD hosts)
help to better deal with BUS_MCEERR_AR signal instead of crashing --
this looks like a necessary step to me.
HTH,
William.
[PATCH v4 3/3] i386: Add support for SUCCOR feature, John Allen, 2023/09/12