qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2 0/6] cxl: add poison event handler


From: Dan Williams
Subject: Re: [RFC PATCH v2 0/6] cxl: add poison event handler
Date: Fri, 29 Mar 2024 13:56:16 -0700

Alison Schofield wrote:
> On Fri, Mar 29, 2024 at 11:22:32AM -0700, Dan Williams wrote:
> > Alison Schofield wrote:
> > [..]
> > > Upon receipt of that new poison list, call memory_failture_queue()
> > > on *any* poison in a mapped space. Is that OK?  Can we call
> > > memory_failure_queue() on any and every poison report that is in
> > > HPA space regardless of whether it first came to us through a GMER?
> > > I'm actually wondering if that is going to be the next ask anyway -
> > > ie report all poison.
> > 
> > memory_failure_queue() should be called on poison creation events. Leave
> > the MF_ACTION_REQUIRED flag not set so that memory_failure() performs
> > "action optional" handling.  So I would expect memory_failure_queue()
> > notification for GMER events, but not on poison list events.
> 
> Seems I totally missed the point of this patch set.
> Is it's only purpose to make sure that poison that is injected gets
> reported to memory_failure?

Clarify terms, "poison injection" to me is a debug-only event to test
that poison handling is working, "poison creation" is an event where new
poison was encountered by CPU consumption, deposited by a
DMA-with-poison transaction, or discovered by a background scrub
operation.

> 
> So this single patch only:
> 1. Poison inject leads to this GMER/CXL_EVENT_TRANSACTION_INJECT_POISON 

Inject is a special case. Likely this should copy the PMEM legacy where
notifying memory_failure() on injected poison is optional:

"ndctl inject-error --no-notify"

> 2. Driver sees GMER/CXL_EVENT_TRANSACTION_INJECT_POISON and reads poison
> list to get accurate length.

Again, inject is the least interesting for the common case, production
kernels care about "Media ECC Error" and "Scrub Media ECC Error"
regardless of transaction type.

> 3. Driver reports that to memory_failure_queue()
> 
> Still expect there's some code sharing opportunities and I still wonder
> about what is next in this area.

One area this needs to be careful is in unifying the OS-first and
FW-first paths. In the FW-first case the platform can trigger
memory_failure() along with the GMER by just posting a memory failure
CPER record. So there is a risk that things get doubly reported if the
GMER handling code blindly triggers memory_failure(). Might be benign,
but probably best avoided.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]