qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 9/9] tests/qtest: massively speed up migration-test


From: Peter Xu
Subject: Re: [PATCH v3 9/9] tests/qtest: massively speed up migration-test
Date: Thu, 1 Jun 2023 12:22:36 -0400

On Thu, Jun 01, 2023 at 05:05:23PM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 01, 2023 at 11:46:01AM -0400, Peter Xu wrote:
> > On Wed, May 31, 2023 at 02:24:00PM +0100, Daniel P. Berrangé wrote:
> > > The migration test cases that actually exercise live migration want to
> > > ensure there is a minimum of two iterations of pre-copy, in order to
> > > exercise the dirty tracking code.
> > > 
> > > Historically we've queried the migration status, looking for the
> > > 'dirty-sync-count' value to increment to track iterations. This was
> > > not entirely reliable because often all the data would get transferred
> > > quickly enough that the migration would finish before we wanted it
> > > to. So we massively dropped the bandwidth and max downtime to
> > > guarantee non-convergance. This had the unfortunate side effect
> > > that every migration took at least 30 seconds to run (100 MB of
> > > dirty pages / 3 MB/sec).
> > > 
> > > This optimization takes a different approach to ensuring that a
> > > mimimum of two iterations. Rather than waiting for dirty-sync-count
> > > to increment, directly look for an indication that the source VM
> > > has dirtied RAM that has already been transferred.
> > > 
> > > On the source VM a magic marker is written just after the 3 MB
> > > offset. The destination VM is now montiored to detect when the
> > > magic marker is transferred. This gives a guarantee that the
> > > first 3 MB of memory have been transferred. Now the source VM
> > > memory is monitored at exactly the 3MB offset until we observe
> > > a flip in its value. This gives us a guaranteed that the guest
> > > workload has dirtied a byte that has already been transferred.
> > > 
> > > Since we're looking at a place that is only 3 MB from the start
> > > of memory, with the 3 MB/sec bandwidth, this test should complete
> > > in 1 second, instead of 30 seconds.
> > > 
> > > Once we've proved there is some dirty memory, migration can be
> > > set back to full speed for the remainder of the 1st iteration,
> > > and the entire of the second iteration at which point migration
> > > should be complete.
> > > 
> > > On a test machine this further reduces the migration test time
> > > from 8 minutes to 1 minute 40.
> > 
> > The outcome is definitely nice, but it does looks slightly hacky to me and
> > make the test slightly more complicated.
> > 
> > If it's all about making sure we finish the 1st iteration, can we simply
> > add a src qemu parameter "switchover-hold"?  If it's set, src never
> > switchover to dst but keeps the iterations.
> 
> For *most* of the tests, we want to ensure there are a minimum
> of 2 iterations. For the XBZRLE test we want to ensure there are
> a minimum of 3 iterations, so the cache gets  workout.
> 
> > Then migrate_ensure_non_converge() will be as simple as setting
> > switchover-hold to true.
> > 
> > I am even thinking whether there can even be real-life use case for that,
> > e.g., where a user might want to have a pre-heat of a migration of some VM,
> > and trigger it immediately when the admin really wants (the pre-heats moved
> > most of the pages and keep doing so).
> > 
> > It'll be also similar to what Avihai proposed here on switchover-ack, just
> > an ack mechanism on the src side:
> > 
> > https://lore.kernel.org/r/20230530144821.1557-3-avihaih@nvidia.com
> 
> In general I strongly wanted to avoid adding special logic to the
> migration code that makes it directly synchronize with the  test
> suite, because once we do that I don't think the test suite is a
> providing coverage of the real world usage scenario.

The problem is non-live is already not real world usage in most cases.  It
seems we all agreed that it's the code paths to cover not real world usages
in the tests, or maybe not?

> 
> IOW, if we add a switchover-ack feature, we should certainly have
> *a* test that uses it, but we should not modify other tests because
> we want to exercise the logic as it would run with apps that don't
> rely on switchover-ack.
> 
> Also, this slow migration test is incredibly painful for people right
> now, so I'd like to see us get a speed up committed to git quickly.
> I don't want to block it on a feature proposal that might yet take
> months to merge.

That'll be trivial, afaict.

I just worry that this patch will bring complexity to the test cases,
that's another burden we need to carry besides QEMU itself.

If you want, I can try to prepare such a patch before this weekend, and if
it's complicated enough and take more than next week to review feel free to
go ahead with this one.

I understand the migration test issue was there for a long time.  But still
to me it's important on which may be cleaner for the long term too.

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]