qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Question regarding live-migration with drive-mirror


From: Fiona Ebner
Subject: Re: Question regarding live-migration with drive-mirror
Date: Thu, 29 Sep 2022 11:39:34 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.13.0

Am 28.09.22 um 20:53 schrieb Dr. David Alan Gilbert:
> * Fiona Ebner (f.ebner@proxmox.com) wrote:
>> Hi,
>> recently one of our users provided a backtrace[0] for the following
>> assertion failure during a live migration that uses drive-mirror to sync
>> a local disk:
>>> bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' 
>>> failed
>>
>> The way we do migration with a local disk is essentially:
>> 1. start target instance with a suitable NBD export
>> 2. start drive-mirror on the source side and wait for it to become ready
>> once
>> 3. issue 'migrate' QMP command
>> 4. cancel drive-mirror blockjob after the migration has finished
>>
>> I reproduced the issue with the following fio script running in the
>> guest (to dirty lots of clusters):
>>> fio --name=make-mirror-work --size=100M --direct=1 --rw=randwrite \
>>>     --bs=4k --ioengine=psync --numjobs=5 --runtime=60 --time_based
>>
>> AFAIU, the issue is that nothing guarantees that the drive mirror is
>> ready when the migration inactivates the block drives.
> 
> I don't know the block code well enough; I don't think I'd realised
> that a drive-mirror could become unready.

I actually shouldn't have used "ready" here. Because "ready" just means
that the job is ready to be completed and indeed, it will stay "ready".
But with the default copy-mode=background, new guest writes do mean that
there can be left-over work lying around. Completing/canceling the job
will do that work, but currently, migration doesn't do that automatically.

> 
>> Is using copy-mode=write-blocking for drive-mirror to only way to avoid
>> this issue? There, the downside is that the network (used by the mirror)
>> would become a bottleneck for IO in the guest, while the behavior would
>> really only be needed during the final phase.
> 
> It sounds like you need a way to switch to the blocking mode.

Yes, that would help. I guess it would be:
1. wait for the drive-mirror(s) to become ready
2. switch to blocking mode
3. wait for the drive-mirror(s) to not have any background work left;
i.e. ensure that from now we're always in sync
4. start state migration

Not sure if step 3 can be achieved currently. The BlockJobInfo object
has a "busy" field, but I guess it's possible to have background work
left even if there's no pending IO. At least the comment about draining
below sounds like that could happen.

Might still not be perfect, because migration with a lot of RAM (or slow
network) can take a while, so the guest IO would still be bottlenecked
during that period. But I guess at /some/ point it has to be ;)

> 
>> I guess the assert should be avoided in any case. Here's a few ideas
>> that came to mind:
>> 1. migration should fail gracefully
>> 2. migration should wait for the mirror-jobs to become ready before
>> inactivating the block drives - that would increase the downtime in
>> these situations of course
>> 2A. additionally, drive-mirror could be taken into account when
>> converging the migration somehow?
> 
> Does the migration capaibility 'pause-before-switchover' help you here?
> If enabled, it causes the VM to pause just before the
> bdrv_inactivate_all (and then use migrate-continue to tell it to carry
> on)
> 
> Dave
> 

Thank you for the suggestion! Using the capability and canceling the
block job before issuing 'migrate-continue' is an alternative. I'm just
a bit worried about the longer downtime, but maybe it's not too bad.

Best Regards,
Fiona

>> I noticed the following comment in the mirror implementation
>>>         /* Note that even when no rate limit is applied we need to yield
>>>          * periodically with no pending I/O so that bdrv_drain_all() 
>>> returns.
>>>          * We do so every BLKOCK_JOB_SLICE_TIME nanoseconds, or when there 
>>> is
>>>          * an error, or when the source is clean, whichever comes first. */





reply via email to

[Prev in Thread] Current Thread [Next in Thread]