[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?
From: |
David Gibson |
Subject: |
Re: spapr_events: Sure we may ignore migrate_add_blocker() failure? |
Date: |
Wed, 21 Jul 2021 16:26:29 +1000 |
On Mon, Jul 19, 2021 at 12:41:09PM +0200, Markus Armbruster wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
>
> > On Mon, Jul 19, 2021 at 09:18:07AM +0200, Markus Armbruster wrote:
> >> David Gibson <david@gibson.dropbear.id.au> writes:
> >>
> >> > On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
> >> >> Commit 2500fb423a "migration: Include migration support for machine
> >> >> check handling" adds this:
> >> >>
> >> >> ret = migrate_add_blocker(spapr->fwnmi_migration_blocker,
> >> >> &local_err);
> >> >> if (ret == -EBUSY) {
> >> >> /*
> >> >> * We don't want to abort so we let the migration to continue.
> >> >> * In a rare case, the machine check handler will run on the
> >> >> target.
> >> >> * Though this is not preferable, it is better than aborting
> >> >> * the migration or killing the VM.
> >> >> */
> >> >> warn_report("Received a fwnmi while migration was in progress");
> >> >> }
> >> >>
> >> >> migrate_add_blocker() can fail in two ways:
> >> >>
> >> >> 1. -EBUSY: migration is already in progress
> >> >>
> >> >> Ignoring this one is clearly intentional. The comment explains why.
> >> >> I'm taking it at face value (I'm a spapr ignoramus).
> >> >
> >> > Right. The argument isn't really about papr particularly, except
> >> > insofar as understanding what fwnmi is. fwnmi (FirmWare assisted NMI)
> >> > is a reporting mechanism for certain low-level hardware failures
> >> > (think memory ECC or cpu level faults, IIRC). If we migrate between
> >> > detecting and reporting the error, then the particulars we report will
> >> > be mostly meaningless since they relate to hardware we're no longer
> >> > running on. Hence the migration blocker.
> >> >
> >> > However, migrating away from a (non-fatal) fwnmi error is a pretty
> >> > reasonable response, so we don't want to actually fail a migration if
> >> > its already in progress.
> >> >
> >> >> Aside: I doubt
> >> >> the warning is going to help users.
> >> >
> >> > You're probably right, but it's not very clear how to do better. It
> >> > might possibly help someone in tech support explain why the reported
> >> > fwnmi doesn't seem to match the hardware the guest is (now) running
> >> > on.
> >>
> >> Perhaps pointing to the actual problem could help: the FWNMI's
> >> information is mostly meaningless.
> >
> > Sorry, I don't follow what you're suggesting.
>
> We warn
>
> warning: Received a fwnmi while migration was in progress
>
> when we fail to block migration because it's already in progress.
> But what does this mean? Perhaps warn like this:
>
> warning: FWNMI while migration is in progress
> The guest's report for this may be less than useful.
>
> My phrasing may well be off, but I hope you get the idea.
I see your point. It may be some time before this reaches the top of
priority list, however.
> Note that we keep quiet when we fail to block migration due to
> -only-migrate. I agree with that. The failure makes a difference only
> when migration gets triggered in a narrow time window, which should be
> quite rare. Would be nice to warn when migration does get triggered in
> that time window, though. Not sure it's worth the trouble, in
> particular if we'd have to create infrastructure first.
>
> >
> >>
> >> >> 2. -EACCES: we're running with -only-migratable
> >> >>
> >> >> Why may we ignore -only-migratable here?
> >> >
> >> > Short answer: because I didn't think about that case. Long answer:
> >> > I think we probably shoud ignore it anyway. As above, receiving a
> >> > fwnmi doesn't really prevent migration, it just means that if you're
> >> > unlucky it can report stale information. Since migrating away from a
> >> > possibly-dubious host would be a reasonable response to a non-fatal
> >> > fwnmi, I don't think we want to simply prohibit fwnmi entirely with
> >> > -only-migratable.
> >>
> >> I think the comment text and placement could be improved to make clear
> >> ignoring this failure is intentional, too. How do you like the
> >> following?
> >
> > That's fair..
> >
> >>
> >> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
> >> index a8f2cc6bdc..54d8e856d3 100644
> >> --- a/hw/ppc/spapr_events.c
> >> +++ b/hw/ppc/spapr_events.c
> >> @@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool
> >> recovered)
> >> }
> >> }
> >>
> >> + /*
> >> + * Try to block migration while FWNMI is being handled, so the
> >> + * machine check handler runs where the information passed to it
> >> + * actually makes sense. This won't actually block migration,
> >> + * only delay it slightly. If the attempt fails, carry on.
> >> + */
> >> ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL);
> >> if (ret == -EBUSY) {
> >> - /*
> >> - * We don't want to abort so we let the migration to continue.
> >> - * In a rare case, the machine check handler will run on the
> >> target.
> >> - * Though this is not preferable, it is better than aborting
> >> - * the migration or killing the VM. It is okay to call
> >> - * migrate_del_blocker on a blocker that was not added (which the
> >> - * nmi-interlock handler would do when it's called after this).
> >> - */
> >> warn_report("Received a fwnmi while migration was in progress");
> >> }
> >
> > LGTM.
>
> Thanks, I'll post this.
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
- Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?, (continued)
- Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?, David Gibson, 2021/07/18
- Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?, Markus Armbruster, 2021/07/19
- Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?, David Gibson, 2021/07/19
- Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?, Markus Armbruster, 2021/07/19
- -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?), Markus Armbruster, 2021/07/19
- Re: -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?), Dr. David Alan Gilbert, 2021/07/19
- Re: -only-migrate and the two different uses of migration blockers, Markus Armbruster, 2021/07/20
- Re: -only-migrate and the two different uses of migration blockers, David Gibson, 2021/07/21
- Re: -only-migrate and the two different uses of migration blockers, Dr. David Alan Gilbert, 2021/07/22
- Re: -only-migrate and the two different uses of migration blockers, David Gibson, 2021/07/25
- Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?,
David Gibson <=