qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: QEMU migration-test CI intermittent failure


From: Fabiano Rosas
Subject: Re: QEMU migration-test CI intermittent failure
Date: Thu, 14 Sep 2023 12:57:08 -0300

Peter Xu <peterx@redhat.com> writes:

> On Thu, Sep 14, 2023 at 12:10:04PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Wed, Sep 13, 2023 at 04:42:31PM -0300, Fabiano Rosas wrote:
>> >> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> >> 
>> >> > Hi,
>> >> > The following intermittent failure occurred in the CI and I have filed
>> >> > an Issue for it:
>> >> > https://gitlab.com/qemu-project/qemu/-/issues/1886
>> >> >
>> >> > Output:
>> >> >
>> >> >   >>> QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=116 
>> >> > QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
>> >> > G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh
>> >> >  QTEST_QEMU_BINARY=./qemu-system-x86_64 
>> >> > /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >> >   ――――――――――――――――――――――――――――――――――――― ✀  
>> >> > ―――――――――――――――――――――――――――――――――――――
>> >> >   stderr:
>> >> >   qemu-system-x86_64: Unable to read from socket: Connection reset by 
>> >> > peer
>> >> >   Memory content inconsistency at 5b43000 first_byte = bd last_byte = 
>> >> > bc current = 4f hit_edge = 1
>> >> >   **
>> >> >   ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion 
>> >> > failed: (bad == 0)
>> >> >   (test program exited with status code -6)
>> >> >
>> >> > You can find the full output here:
>> >> > https://gitlab.com/qemu-project/qemu/-/jobs/5080200417
>> >> 
>> >> This is the postcopy return path issue that I'm addressing here:
>> >> 
>> >> 20230911171320.24372-1-farosas@suse.de">https://lore.kernel.org/r/20230911171320.24372-1-farosas@suse.de
>> >> Subject: [PATCH v6 00/10] Fix segfault on migration return path
>> >> Message-ID: <20230911171320.24372-1-farosas@suse.de>
>> >
>> > Hmm I just noticed one thing, that Stefan's failure is a ram check issue
>> > only, which means qemu won't crash?
>> >
>> 
>> The source could have crashed and left the migration at an inconsistent
>> state and then the destination saw corrupted memory?
>> 
>> > Fabiano, are you sure it's the same issue on your return-path fix?
>> >
>> 
>> I've been running the preempt tests on my branch for thousands of
>> iterations and didn't see any other errors. Since there's no code going
>> into the migration tree recently I assume it's the same error.
>> 
>> I run the tests with GDB attached to QEMU, so I'll always see a crash
>> before any memory corruption.
>
> Okay, maybe that stops you from seeing the above check_guests_ram() error?
> Worth checking whether it fails differently always if you just don't attach
> gdb to it; I had a feeling that it'll always fail in the other way (I think
> migration-test will say something like "qemu killed" etc. in most cases),
> further to identify the issues.
>
>> 
>> > I'm also trying to reproduce either of them with some loads.  I think I hit
>> > some but it's very hard to reproduce solidly.
>> 
>> Well, if you find anything else let me know and we'll fix it.
>
> I think Stefan's issue is the one I triggered once, but only once; I did
> see check_guests_ram() lines.
>
> I ran concurrently 10 migration-tests (on 8 cores; just to make scheduler
> start to really work), each looping over preempt/plain for 500 times and
> hit nothing..  I'm trying again with a larger host with more instances, so
> far I've run 200 loops over 40 instances running together, I hit
> nothing.. but I'm keeping trying.

I managed to reproduce it. It's not the return path error. In hindsight
that's obvious because that error happens in the 'recovery' test and this
one in the 'plain' one. Sorry about the noise.

This one reproduced with just 4 iterations of preempt/plain. I'll
investigate.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]