qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)


From: Maciej S. Szmigiero
Subject: Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
Date: Tue, 22 Sep 2020 00:22:12 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0

Hi David,

Thank you for your comments.

First, I want to underline that this driver targets Windows guests,
where ability to modify and adapt the guest memory management
code is extremely limited.

While it does work with Linux guests, too, this is definitely not its
native environment.

It also has to support rather big guests, up to 1 TB of RAM, so
performance-related things are important.

Further answers are bellow.

On 21.09.2020 11:10, David Hildenbrand wrote:
> On 20.09.20 15:25, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This series adds a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
>> and its protocol definitions.
>> Also included is a driver providing backing devices for memory hot-add
>> protocols ("haprots").
>>
>> A haprot device works like a virtual DIMM stick: it allows inserting
>> extra RAM into the guest at run time.
>>
>> The main differences from the ACPI-based PC DIMM hotplug are:
>> * Notifying the guest about the new memory range is not done via ACPI but
>> via a protocol handler that registers with the haprot framework.
>> This means that the ACPI DIMM slot limit does not apply.
>>
>> * A protocol handler can prevent removal of a haprot device when it is
>> still in use by setting its "busy" field.
>>
>> * A protocol handler can also register an "unplug" callback so it gets
>> notified when an user decides to remove the haprot device.
>> This way the protocol handler can inform the guest about this fact and / or
>> do its own cleanup.
>>
>> The hv-balloon driver is like virtio-balloon on steroids: it allows both
>> changing the guest memory allocation via ballooning and inserting extra
>> RAM into it by adding haprot virtual DIMM sticks.
>> One of advantages of these over ACPI-based PC DIMM hotplug is that such
>> memory can be hotplugged in much smaller granularity because the ACPI DIMM
>> slot limit does not apply.
> 
> Reading further below, it's essentially DIMM-based memory hotplug +
> virtio-balloon - except the 256MB DIMM limit. But reading below, I don't
> see how you want to avoid the KVM memory slot limit that's in a similar
> size (I recall 256*2 due to 2 address spaces). 

The idea is to use virtual DIMM sticks for hot-adding extra memory at
runtime, while using ballooning for runtime adjustment of the guest
memory size within the current maximum.

When the guest is rebooted the virtual DIMMs configuration is adjusted
by the software controlling QEMU (some are removed and / or some are
added) to give the guest the same effective memory size as it had before
the reboot.

So, yes, it will be a problem if the user expands their running guest
~256 times, each time making it even bigger than previously, without
rebooting it even once, but this does seem to be an edge use case.

In the future it would be better to automatically turn the current
effective guest size into its boot memory size when the VM restarts
(the VM will then have no virtual DIMMs inserted after a reboot), but
doing this requires quite a few changes to QEMU, that's why it isn't
there yet.

The above is basically how Hyper-V hypervisor handles its memory size
changes and it seems to be as close to having a transparently resizable
guest as reasonably possible.


> Or avoid VMA limits when wanting to grow a VM big in very tiny steps over
> time (e.g., adding 64MB at a time).

Not sure if you are taking about VMA limits inside the host or the guest.
>>
>> In contrast with ACPI DIMM hotplug where one can only request to unplug a
>> whole DIMM stick this driver allows removing memory from guest in single
>> page (4k) units via ballooning.
>> Then, once the guest has released the whole memory backed by a haprot
>> virtual DIMM stick such device is marked "unused" and can be removed from
>> the VM, if one wants so.
>> A "HV_BALLOON_HAPROT_UNUSED" QMP event is emitted in this case so the
>> software controlling QEMU knows that this operation is now possible.
>>
>> The haprot devices are also marked unused after a VM reboot (with a
>> corresponding "HV_BALLOON_HAPROT_UNUSED" QMP event).
>> They are automatically reinserted (if still present) after the guest
>> reconnects to this protocol (a "HV_BALLOON_HAPROT_INUSE" QMP event is then
>> emitted).
>>
>> For performance reasons, the guest-released memory is tracked in few range
>> trees, as a series of (start, count) ranges.
>> Each time a new page range is inserted into such tree its neighbors are
>> checked as candidates for possible merging with it.
>>
>> Besides performance reasons, the Dynamic Memory protocol itself uses page
>> ranges as the data structure in its messages, so relevant pages need to be
>> merged into such ranges anyway.
>>
>> One has to be careful when tracking the guest-released pages, since the
>> guest can maliciously report returning pages outside its current address
>> space, which later clash with the address range of newly added memory.
>> Similarly, the guest can report freeing the same page twice.
>>
>> The above design results in much better ballooning performance than when
>> using virtio-balloon with the same guest: 230 GB / minute with this driver
>> versus 70 GB / minute with virtio-balloon.
> 
> I assume these numbers apply with Windows guests only. IIRC Linux
> hv_balloon does not support page migration/compaction, while
> virtio-balloon does. So you might end up with quite some fragmented
> memory with hv_balloon in Linux guests - of course, usually only in
> corner cases.

As I previously mentioned, this driver targets mainly Windows guests.

And Windows seems to be rather determined to free the requested number
of pages: waiting for the guest to reply to a 2GB balloon request
sometimes takes 2-3 seconds.
So i guess it does some kind of memory compaction during that request
processing time.

>>
>> During a ballooning operation most of time is spent waiting for the guest
>> to come up with newly freed page ranges, processing the received ranges on
>> the host side (in QEMU / KVM) is nearly instantaneous.
>>
>> The unballoon operation is also pretty much instantaneous:
>> thanks to the merging of the ballooned out page ranges 200 GB of memory can
>> be returned to the guest in about 1 second.
>> With virtio-balloon this operation takes about 2.5 minutes.
>>
>> These tests were done against a Windows Server 2019 guest running on a
>> Xeon E5-2699, after dirtying the whole memory inside guest before each
>> balloon operation.
>>
>> Using a range tree instead of a bitmap to track the removed memory also
>> means that the solution scales well with the guest size: even a 1 TB range
>> takes just few bytes of memory.
>> Example usage:
>> * Add "-device vmbus-bridge,id=vmbus-bridge -device hv-balloon,id=hvb"
>>   to the QEMU command line and set "maxmem" value to something large,
>>   like 1T.
>>
>> * Use QEMU monitor commands to add a haprot virtual DIMM stick, together
>>   with its memory backend:
>>   object_add memory-backend-ram,id=mem1,size=200G
>>   device_add mem-haprot,id=ha1,memdev=mem1
>>   The first command is actually the same as for ACPI-based DIMM hotplug.
>>
>> * Use the ballooning interface monitor commands to force the guest to give
>>   out as much memory as possible:
>>   balloon 1
> 
> At least under virtio-balloon with Linux, that will pretty sure trigger
> a guest crash. Is something like that expected to work with Windows
> guests reasonably well?

Windows will generally leave some memory free when processing balloon
requests, although the precise amount varies between few hundred MB to
values like 1+ GB.

Usually it runs stable even with these few hundred MBs of free memory
remaining but I have seen occasional crashes at shutdown time in this
case (probably something critical failing to initialize due to the
system running out of memory).

While the above command was just a quick example, I personally think
it is the guest who should be enforcing a balloon floor since it is
the guest that knows its internal memory requirements, not the host.

For this reason the hv_balloon client driver inside the Linux kernel
implements its own, rough balloon floor - see compute_balloon_floor().

On the other hand, one can also argue that the user wish should be
respected as much as possible.

>>   The ballooning interface monitor commands can also be used to resize
>>   the guest up and down appropriately.
>>
>> * One can check the current guest size by issuing a "info balloon" command.
>>   This is useful to know what is happening, since large ballooning or
>>   unballooning operations take some time to complete.
> 
> So, every time you want to add more memory (after the balloon was
> deflated) to a guest, you have to plug a new mem-haprot device, correct?

Yes.

> So your QEMU user has to be well aware of how to balance "balloon" and
> "object_add/device_add/object_del_device_del" commands to achieve the
> desired guest size.

In this case the VM user does not interact directly with the QEMU process.

Rather, the user tells the software controlling QEMU (think: libvirt)
their wish how large they want the guest to be and this software then
does everything what is necessary to achieve such target and make it
persistent across guest reboots.

 
>>
>> * Once the guest releases the whole memory backed by a haprot device
>>   (or is restarted) a "HV_BALLOON_HAPROT_UNUSED" QMP event will be
>>   generated.
>>   The haprot device then can be removed, together with its memory backend:
>>   device_del ha1
>>   object_del mem1
> 
> So, you rely on some external entity to properly shrink a guest again
> (e.g., during reboot).

Yes.

>>
>> Future directions:
>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>   virtio-balloon drivers.
>>   Currently, only one of them can be added to the VM at the same time.
> 
> Yeah, that makes sense. Only one at a time.

Having only one *active* at a time makes sense, however it ultimately
would be nice to be able to have them both inserted into a VM:
one for Windows guests and one for Linux ones.
Even though only one obviously would be active at the same time.

>>
>> * Allow new haport devices to reuse the same address range as the ones
>>   that were previously deleted via device_del monitor command without
>>   having to restart the VM.
>>
>> * Add vmstate / live migration support to the hv-balloon driver.
>>
>> * Use haprot device to also add memory via virtio interface (this requires
>>   defining a new operation in virtio-balloon protocol and appropriate
>>   support from the client virtio-balloon driver in the Linux kernel).
> 
> Most probably not the direction we are going to take. We have virtio-mem
> for clean, fine-grained, NUMA-aware, paravirtualized memory hot(un)plug
> now, and we are well aware of various issues with (base-page size based)
> memory ballooning that are fairly impossible to solve (especially in the
> context of vfio).
> 

Thanks,
Maciej



reply via email to

[Prev in Thread] Current Thread [Next in Thread]