qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: QEMU migration-test CI intermittent failure


From: Peter Xu
Subject: Re: QEMU migration-test CI intermittent failure
Date: Thu, 14 Sep 2023 11:35:47 -0400

On Thu, Sep 14, 2023 at 12:10:04PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Sep 13, 2023 at 04:42:31PM -0300, Fabiano Rosas wrote:
> >> Stefan Hajnoczi <stefanha@redhat.com> writes:
> >> 
> >> > Hi,
> >> > The following intermittent failure occurred in the CI and I have filed
> >> > an Issue for it:
> >> > https://gitlab.com/qemu-project/qemu/-/issues/1886
> >> >
> >> > Output:
> >> >
> >> >   >>> QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=116 
> >> > QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
> >> > G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh
> >> >  QTEST_QEMU_BINARY=./qemu-system-x86_64 
> >> > /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> >> >   ――――――――――――――――――――――――――――――――――――― ✀  
> >> > ―――――――――――――――――――――――――――――――――――――
> >> >   stderr:
> >> >   qemu-system-x86_64: Unable to read from socket: Connection reset by 
> >> > peer
> >> >   Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc 
> >> > current = 4f hit_edge = 1
> >> >   **
> >> >   ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion 
> >> > failed: (bad == 0)
> >> >   (test program exited with status code -6)
> >> >
> >> > You can find the full output here:
> >> > https://gitlab.com/qemu-project/qemu/-/jobs/5080200417
> >> 
> >> This is the postcopy return path issue that I'm addressing here:
> >> 
> >> 20230911171320.24372-1-farosas@suse.de">https://lore.kernel.org/r/20230911171320.24372-1-farosas@suse.de
> >> Subject: [PATCH v6 00/10] Fix segfault on migration return path
> >> Message-ID: <20230911171320.24372-1-farosas@suse.de>
> >
> > Hmm I just noticed one thing, that Stefan's failure is a ram check issue
> > only, which means qemu won't crash?
> >
> 
> The source could have crashed and left the migration at an inconsistent
> state and then the destination saw corrupted memory?
> 
> > Fabiano, are you sure it's the same issue on your return-path fix?
> >
> 
> I've been running the preempt tests on my branch for thousands of
> iterations and didn't see any other errors. Since there's no code going
> into the migration tree recently I assume it's the same error.
> 
> I run the tests with GDB attached to QEMU, so I'll always see a crash
> before any memory corruption.

Okay, maybe that stops you from seeing the above check_guests_ram() error?
Worth checking whether it fails differently always if you just don't attach
gdb to it; I had a feeling that it'll always fail in the other way (I think
migration-test will say something like "qemu killed" etc. in most cases),
further to identify the issues.

> 
> > I'm also trying to reproduce either of them with some loads.  I think I hit
> > some but it's very hard to reproduce solidly.
> 
> Well, if you find anything else let me know and we'll fix it.

I think Stefan's issue is the one I triggered once, but only once; I did
see check_guests_ram() lines.

I ran concurrently 10 migration-tests (on 8 cores; just to make scheduler
start to really work), each looping over preempt/plain for 500 times and
hit nothing..  I'm trying again with a larger host with more instances, so
far I've run 200 loops over 40 instances running together, I hit
nothing.. but I'm keeping trying.

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]