qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: QEMU migration-test CI intermittent failure


From: Fabiano Rosas
Subject: Re: QEMU migration-test CI intermittent failure
Date: Thu, 14 Sep 2023 19:54:17 -0300

Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Thu, Sep 14, 2023 at 12:57:08PM -0300, Fabiano Rosas wrote:
>>> I managed to reproduce it. It's not the return path error. In hindsight
>>> that's obvious because that error happens in the 'recovery' test and this
>>> one in the 'plain' one. Sorry about the noise.
>>
>> No worry.  It's good to finally identify that.
>>
>>> 
>>> This one reproduced with just 4 iterations of preempt/plain. I'll
>>> investigate.
>
> It seems that we're getting a tcp disconnect (ECONNRESET) on when doing
> that shutdown() on postcopy_qemufile_src. The one from commit 6621883f93
> ("migration: Fix potential race on postcopy_qemufile_src").
>
> I'm trying to determine why that happens when other times it just
> returns 0 as expected.
>
> Could this mean that we're kicking the dest too soon while it is still
> receiving valid data?

Looking a bit more into this, what's happening is that
postcopy_ram_incoming_cleanup() is shutting the postcopy_qemufile_dst
while ram_load_postcopy() is still running.

The postcopy_ram_listen_thread() function waits for the
main_thread_load_event, but that only works when not using preempt. With
the preempt thread, the event is set right away and we proceed to do the
cleanup without waiting.

So the assumption of commit 6621883f93 that the incoming side knows when
it has finished migrating is wrong IMO. Without the EOS we're relying on
the chance that the shutdown() happens after the last recvmsg has
returned and not during it.

Peter, what do you think?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]