qemu-s390x
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64


From: Dr. David Alan Gilbert
Subject: Re: [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64
Date: Tue, 5 Jul 2022 09:38:46 +0100
User-agent: Mutt/2.2.6 (2022-06-05)

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote:
> > On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> > > There have been checks put into the migration test which skip it in a
> > > few scenarios
> > > 
> > >   * ppc64 TCG
> > >   * ppc64 KVM with kvm-pr
> > >   * s390x TCG
> > > 
> > > In the original commits there are references to unexplained hangs in
> > > the test. There is no record of details of where it was hanging, but
> > > it is suspected that these were all a result of the max downtime limit
> > > being set at too low a value to guarantee convergance.
> > > 
> > > Since a previous commit bumped the value from 1 second to 30 seconds,
> > > it is believed that hangs due to non-convergance should be eliminated
> > > and thus worth trying to remove the skipped scenarios.
> > > 
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >   tests/qtest/migration-test.c | 21 ---------------------
> > >   1 file changed, 21 deletions(-)
> > 
> > I just gave this a try, and it's failing on my x86 laptop with the ppc64 
> > target:
> > 
> > /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't
> > support requested feature, cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > Memory content inconsistency at df6000 first_byte = 98 last_byte = 98
> > current = 2 hit_edge = 0

98->2 is a strangely large gap, and just one page.

> > Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1

Yeh that's broken;   the way I think about this is you've got a loop
and the guest is following the loop incrementing one page at a time;
if you stop the world you should see one 'edge' where the incrementer
has currently incremented the previous page but hasn't done the current
page yet.   e.g. in this case the 'start' of the memory is 98, and we
were seeing 97, so we've run past that 'edge' at some point earlier.
Now we've hit 96, that should be impossible, because all of the 96's
should have incremented out before there was ever a 98 in the loop.

> > Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > and in another 5542 pages**
> > ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram:
> > assertion failed: (bad == 0)
> > Aborted (core dumped)
> > 
> > So I guess this workaround was about a different issue and we should drop
> > this patch.
> 
> Yeah, at the very least needs for investigation.
> 
> It is a little worrying though that we get such failures as it smells
> like a genuine bug that we've been missing from having tests disabled.

Yeh I suspect it's a TCG bug not updating the 'changed' flag on the page
*after* writing the data.  I believe we've sene a case on ARM.

Dave

> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]