qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/5] migration: Fix possible deadloop of ram save process


From: Peter Xu
Subject: Re: [PATCH 1/5] migration: Fix possible deadloop of ram save process
Date: Thu, 22 Sep 2022 11:25:38 -0400

On Thu, Sep 22, 2022 at 03:49:38PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > When starting ram saving procedure (especially at the completion phase),
> > always set last_seen_block to non-NULL to make sure we can always correctly
> > detect the case where "we've migrated all the dirty pages".
> > 
> > Then we'll guarantee both last_seen_block and pss.block will be valid
> > always before the loop starts.
> > 
> > See the comment in the code for some details.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Yeh I guess it can currently only happen during restart?

There're only two places to clear last_seen_block:

ram_state_reset[2683]          rs->last_seen_block = NULL;
ram_postcopy_send_discard_bitmap[2876] rs->last_seen_block = NULL;

Where for the reset case:

ram_state_init[2994]           ram_state_reset(*rsp);
ram_state_resume_prepare[3110] ram_state_reset(rs);
ram_save_iterate[3271]         ram_state_reset(rs);

So I think it can at least happen in two places, either (1) postcopy just
started (assume when postcopy starts accidentally when all dirty pages were
migrated?), or (2) postcopy recover from failure.

In my case I triggered this deadloop when I was debugging the other bug
fixed by the next patch where it was postcopy recovery (on tls), but only
once..  So currently I'm still not 100% sure whether this is the same
problem, but logically it could trigger.

I also remember I used to hit very rare deadloops before too, maybe they're
the same thing because I did test recovery a lot.

> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks!

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]