qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reducing vdpa migration downtime because of memory pin / maps


From: Si-Wei Liu
Subject: Re: Reducing vdpa migration downtime because of memory pin / maps
Date: Mon, 26 Jun 2023 23:36:15 -0700
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0



On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:
On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:

On 6/7/23 01:08, Eugenio Perez Martin wrote:
On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
Sorry for reviving this old thread, I lost the best timing to follow up
on this while I was on vacation. I have been working on this and found
out some discrepancy, please see below.

On 4/5/23 04:37, Eugenio Perez Martin wrote:
Hi!

As mentioned in the last upstream virtio-networking meeting, one of
the factors that adds more downtime to migration is the handling of
the guest memory (pin, map, etc). At this moment this handling is
bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
destination device waits until all the guest memory / state is
migrated to start pinning all the memory.

The proposal is to bind it to the char device life cycle (open vs
close),
Hmmm, really? If it's the life cycle for char device, the next guest /
qemu launch on the same vhost-vdpa device node won't make it work.

Maybe my sentence was not accurate, but I think we're on the same page here.

Two qemu instances opening the same char device at the same time are
not allowed, and vhost_vdpa_release clean all the maps. So the next
qemu that opens the char device should see a clean device anyway.
I mean the pin can't be done at the time of char device open, where the
user address space is not known/bound yet. The earliest point possible
for pinning would be until the vhost_attach_mm() call from SET_OWNER is
done.
Maybe we are deviating, let me start again.

Using QEMU code, what I'm proposing is to modify the lifecycle of the
.listener member of struct vhost_vdpa.

At this moment, the memory listener is registered at
vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
and is unregistered in both vhost_vdpa_reset_status and
vhost_vdpa_cleanup.

My original proposal was just to move the memory listener registration
to the last vhost_vdpa_init, and remove the unregister from
vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
be the same, the device should not realize this change.
This can address LM downtime latency for sure, but it won't help downtime during dynamic SVQ switch - which still needs to go through the full unmap/map cycle (that includes the slow part for pinning) from passthrough to SVQ mode. Be noted not every device could work with a separate ASID for SVQ descriptors. The fix should expect to work on normal vDPA vendor devices without a separate descriptor ASID, with platform IOMMU underneath or with on-chip IOMMU.


One of the concerns was that it could delay VM initialization, and I
didn't profile it but I think that may be the case.
Yes, that's the concern here - we should not introduce regression to normal VM boot process/time. In case of large VM it's very easy to see the side effect if we go this way.

  I'm not sure about
the right fix but I think the change is easy to profile. If that is
the case, we could:
* use a flag (listener->address_space ?) and only register the
listener in _init if waiting for a migration, do it in _start
otherwise.
Just doing this alone won't help SVQ mode switch downtime, see the reason stated above.

* something like io_uring, where the map can be done in parallel and
we can fail _start if some of them fails.
This can alleviate the problem somehow, but still sub-optimal and not scalable with larger size. I'd like zero or least pinning to be done at the SVQ switch or migration time.


Actually I think the counterpart vhost_detach_mm() only gets
handled in vhost_vdpa_release() at device close time is a resulting
artifact and amiss of today's vhost protocol - the opposite RESET_OWNER
call is not made mandatory hence only seen implemented in vhost-net
device today. One qemu instance could well exec(3) another new qemu
instance to live upgrade itself while keeping all emulated devices and
guest alive. The current vhost design simply prohibits this from happening.

Ok, I was not aware of this. Thanks for explaining it!

    so all the guest memory can be pinned for all the guest / qemu
lifecycle.
I think to tie pinning to guest / qemu process life cycle makes more
sense. Essentially this pinning part needs to be decoupled from the
iotlb mapping abstraction layer, and can / should work as a standalone
uAPI. Such that QEMU at the destination may launch and pin all guest's
memory as needed without having to start the device, while awaiting any
incoming migration request. Though problem is, there's no existing vhost
uAPI that could properly serve as the vehicle for that. SET_OWNER /
SET_MEM_TABLE / RESET_OWNER seems a remote fit.. Any objection against
introducing a new but clean vhost uAPI for pinning guest pages, subject
to guest's life cycle?

I think that to pin or not pin memory maps should be a kernel
decision, not to be driven by qemu.
It's kernel decision for sure. I am with this part.

I'm not against it if needed, but
let me know if the current "clean at close" address your concerns.
To better facilitate QEMU exec (live update) case, I propose we add new
vhost uAPI pair for explicit pinning request - which would live with
user mm's, or more precisely qemu instance's lifecycle.

Ok I see your problem better now, but I think it should be solved at
kernel level. Does that live update need to forcefully unpin and pin
the memory again,
No, it should avoid the unpin&pin cycle, otherwise it'd defeat the downtime expectation. The exec(3)'d process should inherit the page pinning and/or mlock accounting from the original QEMU process, while keeping original page pinning intact. Physical page mappings for DMA can be kept as is to avoid the need of reprogramming device, though in this case the existing vhost iotlb entries should be updated to reflect the new HVA in the exec(3)'d QEMU process.

  or that is just a consequence of how it works the
memory listener right now?

Why not extend the RESET_OWNER to the rest of devices? It seems the
most natural way to me.
Not sure, I think RESET_OWNER might be too heavy weighted to implement live update, and people are not clear what the exact semantics are by using it (which part of the device state is being reset, and how much)... In addition, people working on iommufd intended to make this a "one-shot" ioctl e.g. CHANGE_OWNER instead of RESET_OWNER+SET_OWNER:

https://lore.kernel.org/kvm/Y5Ibvv9PNMifi0NF@ziepe.ca/

New uAPI to just change ownership of mm seems a better fit to me...

Thanks,
-Siwei


Thanks!


Another concern is the use_va stuff, originally it tags to the device
level and is made static at the time of device instantiation, which is
fine. But others to come just find a new home at per-group level or
per-vq level struct. Hard to tell whether or not pinning is actually
needed for the latter use_va friends, as they are essentially tied to
the virtio life cycle or feature negotiation. While guest / Qemu starts
way earlier than that. Perhaps just ignore those sub-device level use_va
usages? Presumably !use_va at the device level is sufficient to infer
the need of pinning for device?

I don't follow this. But I have the feeling that the subject of my
original mail is way more accurate if I would have said just "memory
maps".
I think the iotlb layer in vhost-vdpa just provides the abstraction for
mapping, not pinning. Although in some case mapping implicitly relies on
pinning for DMA purpose, it doesn't have to tie to that in uAPI
semantics. We can do explicit on-demand pinning for cases for e.g.
warming up device at live migration destination.


I still consider the way to fix it is to actually delegate that to the
kernel vdpa, so it can choose if a particular ASID needs the pin or
not. But let me know if I missed something.
You can disregard this for now. I will discuss that further with you
guys while bind_mm and per-group use_va stuffs are landed.

Thanks!
-Siwei



Thanks!

Regards,
-Siwei


This has two main problems:
* At this moment the reset semantics forces the vdpa device to unmap
all the memory. So this change needs a vhost vdpa feature flag.
* This may increase the initialization time. Maybe we can delay it if
qemu is not the destination of a LM. Anyway I think this should be
done as an optimization on top.

Any ideas or comments in this regard?

Thanks!





reply via email to

[Prev in Thread] Current Thread [Next in Thread]