qemu-s390x
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 7/7] migration: call qemu_savevm_state_pending_exact() with the


From: Juan Quintela
Subject: Re: [RFC 7/7] migration: call qemu_savevm_state_pending_exact() with the guest stopped
Date: Thu, 13 Oct 2022 18:08:56 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux)

Joao Martins <joao.m.martins@oracle.com> wrote:
> +Avihai, +Jason
>
> On 03/10/2022 04:16, Juan Quintela wrote:
>> HACK ahead.
>> 
>> There are devices that require the guest to be stopped to tell us what
>> is the size of its state. 
>
> "... and have transferred said device state" if we are talking current vfio.

Yeap.

> We can't query size of the data_fd right now

Oops.  My understanding was that once the guest is stopped you can say
how big is it.  That is a deal breaker.  We can't continue if we don't
know the size.  Copying the whole state to know the size looks too much.

> It would need a @data_size in addition to @data_fd in
> vfio_device_feature_mig_state, or getting fseek supported over the fd

Ok, we can decided how to do it, but I think that putting fseek into it
will only make things more complicated.

>> So we need to stop the vm "before" we
>> cal the functions.
>> 
>> It is a hack because:
>> - we are "starting" the guest again to stop it in migration_complete()
>>   I know, I know, but it is not trivial to get all the information
>>   easily to migration_complete(), so this hack.
>> 
>
> Could you expand on that? Naively, I was assuming that by 'all information' 
> the
> locally stored counters in migration_iteration_run() that aren't present in
> MigrateState?

This is not related to vfio at all.

The problem is that current code assumes that the guest is still
running, and then do qemu_mutex_lock_iothread() and unlock() inside the
pending handlers.  To stop the guest, we need to lock the iothread
first.  So this was going to hang.  I fixed it doing the lock/unlock
twice.  That is the hack.  I will change the callers once that we decide
what is the right path ahead.  This is not a problem related to vfio. it
is a problem related about how we can stop/resume guests programatically
in the middle of qemu code.

>> - auto_converge test fails with this hack.  I think that it is related
>>   to previous problem.  We start the guest when it is supposed to be
>>   stopped for convergence reasons.
>> 
>> - All experiments that I did to do the proper thing failed with having
>>   the iothread_locked() or try to unlock() it when not locked.
>> 
>> - several of the pending functions are using the iothread lock
>>   themselves, so I need to split it to have two versions (one for the
>>   _estimate() case with the iothread lock), and another for the
>>   _exact() case without the iothread_lock().  I want comments about
>>   this approach before I try to continue on this direction.
>> 
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> ---
>>  migration/migration.c        | 13 +++++++++++++
>>  tests/qtest/migration-test.c |  3 ++-
>>  2 files changed, 15 insertions(+), 1 deletion(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 35e512887a..7374884818 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -3742,7 +3742,20 @@ static MigIterateState 
>> migration_iteration_run(MigrationState *s)
>>      trace_migrate_pending_estimate(pending_size, s->threshold_size, 
>> pend_pre, pend_post);
>>  
>>      if (pend_pre <= s->threshold_size) {
>> +        int old_state = s->state;
>> +        qemu_mutex_lock_iothread();
>> +        // is this really necessary?  it works for me both ways.
>> +        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>> +        s->vm_was_running = runstate_is_running();
>> +        vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>> +        qemu_mutex_unlock_iothread();
>>          qemu_savevm_state_pending_exact(&pend_pre, &pend_post);
>> +        qemu_mutex_lock_iothread();
>> +        runstate_set(old_state);
>> +        if (s->vm_was_running) {
>> +            vm_start();
>> +        }
>> +        qemu_mutex_unlock_iothread();
>
> Couldn't we just have an extra patch that just stores pend_pre and 
> pending_size
> in MigrateState, which would allow all this check to be moved into
> migration_completion(). Or maybe that wasn't an option for some other reason?

This is not an option, because we don't have a way to get back from
migration_completion() to migrate_iteration_run() if there is not enough
space to send all the state.

> Additionally what about having a migration helper function that
> vfio_save_complete_precopy() callback needs to use into to check if the
> expected-device state size meets the threshold/downtime as it is saving the
> device state and otherwise fail earlier accordingly when saving beyond the
> threshold?

That is what I tried to do with the code.
See patch 4. ram.c

How I have two functions:
- ram_state_pending_estimate()
- ram_state_pending_exact()

1st should work without any locking, i.e. just best estimation without
too much work, second should give the real value.

What we had discussed before for vfio is that the device will "cache"
the value returned from previous ram_state_pending_exact().


> It would allow supporting both the (current UAPI) case where you need to
> transfer the state to get device state size (so checking against 
> threshold_size
> pending_pre constantly would allow to not violate the SLA) as well as any 
> other
> UAPI improvement to fseek()/data_size that lets you fail even earlier.
>
> Seems like at least it keeps some of the rules (both under iothread lock) as
> this patch

Coming to worse thing, you need to read the state into a local buffer
and then calculate the size in exact?  Looks overkill, but in your case,
I can't see other way to do it.

My understanding was that for new hardware you have an easy way to
calculate this value "if the guest was stopped".

My next series are a way to do in parallel the read/send of the state
with respect to the migration_thread(), but that is a different
problem.  I need a way to calculate sizes to start with, otherwise  I
have no way to decide if I can enter the completion stage (or for old
devices one can lie and assume than size is zero, but then you are going
to have bigger downtimes).

Later, Juan.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]